In metagenomics, high-throughput sequencing is used to determine the genomic information of all microorganisms in a selected community without the need for isolation or cultivation. In the past, many metagenome studies were limited to the analysis of certain marker genes (16S rDNA genes) to study the composition of the microbiome ("Who is out there?"). Hereby both local and temporal changes of the microbiome can be observed. Recent developments in sequencing technologies allow to address the question about the functional potential of a community ("What are they doing?"). However, this requires a much greater sequencing depth and thus an enormous computational effort to reconstruct longer genomic sequence sections (contigs) by assembling billions of single reads. Reconstructing the complete genome of every member of the community down to the species or even subspecies level is the subject of active research and is still an unsolved problem. Instead, only fragments of the genome can be reconstructed in the form of independent contigs whose identity remains unknown. Computational methods for grouping the contigs into genome-specific groups ("binning") fail in cases of closely related organisms that have very similar genome sequences. Despite these limitations, metagenome studies have been successfully used to identify genes and genomes of interest.
Integration of the various omics technologies allows a comprehensive understanding of the investigated microbial communities. Metatranscriptome studies allow extensive analysis of expression patterns in microbial communities. Computational methods need to be developed to compare the transcription rates of various abundant organisms in multiple samples. Current RNA-Seq algorithms are currently limited to comparisons of expression patterns of individual genomes.
We develop cloud-based analysis workflows using modular containerized applications and integrate the results into web-based systems to support interpretation by biology project partners. NoSQL databases are used to store the genomic sequence and its metadata in order to be able to perform high-performance identification and extraction of interesting features in parallelized pipelines. For comparative analysis with thousands of already published metagenomic datasets, we use the Spark Framework. We guarantee fast access via HTML5-based web applications via technologies that are otherwise used in search engines.
Despite the fact that we managed to assemble a large number of genes and genomes from a complex metagenome as the cow rumen, there is still a need for metagenome-specific assemblers. Current short read assemblers were specifically designed for the assembly of isolate genomes, but metagenome data sets pose a number of challenges on the assembly problem. We are developing new tools and approaches for the metagenomic assembly problem.