Alexander Sczyrba; Bielefeld University, Bielefeld
The increasingly widespread availability and application of high-throughput technologies in the life sciences, such as (meta-)genomics studies or imaging applications, generate an exponentially increasing amount of experimental data. The number of specialized databases distributed around the world is also growing rapidly. Therefore, the storage, integration and processing of this data becomes the bottleneck of the analysis workflows, as they require infrastructures for data storage as well as services for data processing, analysis and possibly special access approval.
According to the definition of the National Institute of Standards and Technology (NIST), “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. Cloud computing plays an important role in many modern bioinformatics analysis workflows, from data management and processing to data integration and analysis, including data exploration and visualization. It provides massively scalable computing and storage infrastructures and can therefore represent the key technology for overcoming the aforementioned problems.
Two exemplary BiBi projects are presented on the following pages
Cloud computing (bioinformatics) services are often divided into the following areas:
Growth of Sequence Read Archive (SRA) database hosted at the National Center for Biotechnology Information (NCBI), USA. The data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. [1]
Virtual environments such as virtual machines (VMs), Docker or Singularity provide maximal flexibility to the users. In contrast to classical high performance environments they are independent from the installed operating system, software stacks libraries. Special requirements can be fulfilled easily without side effects. Additionally, virtual environments allow easy exchange of analysis workflows and with publication of these environments research becomes reproducible.
The cloud computing department of BIBI develops and provides bioinformatics environments and workflows for bioinformatics analyses, mainly in the field of (meta-)genomics. A mirror of SRA’s metagenomics data sets hosted at the de.NBI Cloud site in Bielefeld allows large scale analyses integrating publicly available data. Examples of such projects are described in the following sections.
[1] National Center for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/sra/
[2] The Cost of Sequencing a Human Genome. http://genome.gov/sequencingcosts