Assessment of metagenomic assembly using simulated next generation sequencing data

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Assessment of metagenomic assembly using simulated next generation sequencing data. / Mende, Daniel R; Waller, Alison S; Sunagawa, Shinichi; Järvelin, Aino I; Chan, Michelle M; Arumugam, Manimozhiyan; Raes, Jeroen; Bork, Peer.

In: P L o S One, Vol. 7, No. 2, 2012, p. e31386.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Mende, DR, Waller, AS, Sunagawa, S, Järvelin, AI, Chan, MM, Arumugam, M, Raes, J & Bork, P 2012, 'Assessment of metagenomic assembly using simulated next generation sequencing data', P L o S One, vol. 7, no. 2, pp. e31386. https://doi.org/10.1371/journal.pone.0031386

APA

Mende, D. R., Waller, A. S., Sunagawa, S., Järvelin, A. I., Chan, M. M., Arumugam, M., Raes, J., & Bork, P. (2012). Assessment of metagenomic assembly using simulated next generation sequencing data. P L o S One, 7(2), e31386. https://doi.org/10.1371/journal.pone.0031386

Vancouver

Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, Arumugam M et al. Assessment of metagenomic assembly using simulated next generation sequencing data. P L o S One. 2012;7(2):e31386. https://doi.org/10.1371/journal.pone.0031386

Author

Mende, Daniel R ; Waller, Alison S ; Sunagawa, Shinichi ; Järvelin, Aino I ; Chan, Michelle M ; Arumugam, Manimozhiyan ; Raes, Jeroen ; Bork, Peer. / Assessment of metagenomic assembly using simulated next generation sequencing data. In: P L o S One. 2012 ; Vol. 7, No. 2. pp. e31386.

Bibtex

@article{5a0bf5d4b30640af9aa770c87dce16ff,
title = "Assessment of metagenomic assembly using simulated next generation sequencing data",
abstract = "Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.",
keywords = "Computational Biology, Computer Simulation, Contig Mapping, DNA, Bacterial, Genome, Bacterial, Genomics, Metagenome, Metagenomics, Models, Genetic, Probability, Quality Control, Reproducibility of Results, Sequence Analysis, DNA, Software",
author = "Mende, {Daniel R} and Waller, {Alison S} and Shinichi Sunagawa and J{\"a}rvelin, {Aino I} and Chan, {Michelle M} and Manimozhiyan Arumugam and Jeroen Raes and Peer Bork",
year = "2012",
doi = "10.1371/journal.pone.0031386",
language = "English",
volume = "7",
pages = "e31386",
journal = "PLoS ONE",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "2",

}

RIS

TY - JOUR

T1 - Assessment of metagenomic assembly using simulated next generation sequencing data

AU - Mende, Daniel R

AU - Waller, Alison S

AU - Sunagawa, Shinichi

AU - Järvelin, Aino I

AU - Chan, Michelle M

AU - Arumugam, Manimozhiyan

AU - Raes, Jeroen

AU - Bork, Peer

PY - 2012

Y1 - 2012

N2 - Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.

AB - Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.

KW - Computational Biology

KW - Computer Simulation

KW - Contig Mapping

KW - DNA, Bacterial

KW - Genome, Bacterial

KW - Genomics

KW - Metagenome

KW - Metagenomics

KW - Models, Genetic

KW - Probability

KW - Quality Control

KW - Reproducibility of Results

KW - Sequence Analysis, DNA

KW - Software

U2 - 10.1371/journal.pone.0031386

DO - 10.1371/journal.pone.0031386

M3 - Journal article

C2 - 22384016

VL - 7

SP - e31386

JO - PLoS ONE

JF - PLoS ONE

SN - 1932-6203

IS - 2

ER -

ID: 43975678