- Mavromatis, Konstantinos;
- Ivanova, Natalia;
- Barry, Kerrie;
- Shapiro, Harris;
- Goltsman, Eugene;
- McHardy, Alice;
- Rigoutsos, Isidore;
- Salamov, Asaf;
- Korzeniewski, Frank;
- Land, Miriam;
- Hugenholtz, Phil;
- Kyrpides, Nikos C.
In an effort to evaluate methods used to analyse metagenomes, we constructed three synthetic metagenomic datasets of increased complexity by combining reads from a selection of 113 isolate genome sequencing projects available through the Joint Genome Institute. Isolate genomes were selected to represent populations in metagenomic datasets based on similar patterns of genome size, GC content and relative phylogenetic position. Reads were randomly sampled from the selected genomes to match the read depth of their corresponding populations in the metagenomic assemblies.Sampled reads were then assembled using three programs used to assemble the real metagenomic data (Phrap, Arachne and JAZZ). Assembled contigs were binned using three different methods (oligonucleotide frequency, pattern discovery, best blast hit) and genes were called using two gene prediction pipelines (fgenesh, Critica/Glimmer). Due to the nature of the simulated metagenomes we were able to evaluate the quality of each step in the process. We explore therole of the population distribution as well as the algorithms used in the quality of the final metagenomic dataset used for further analysis.The results of the analysis show the dependency of each step on the dataset profile as well as the different limitations ot the methods. The results of the analysis, as well as the sequences will be available to the public. New methods for assembly, binnning and gene calling can be tested and compared to each other using these datasets.