About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. / Tantoso, Erwin; Eisenhaber, Birgit; Sinha, Swati; Jensen, Lars Juhl; Eisenhaber, Frank.

In: Biology Direct, Vol. 18, 7, 2023.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Tantoso, E, Eisenhaber, B, Sinha, S, Jensen, LJ & Eisenhaber, F 2023, 'About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature', Biology Direct, vol. 18, 7. https://doi.org/10.1186/s13062-023-00362-0

APA

Tantoso, E., Eisenhaber, B., Sinha, S., Jensen, L. J., & Eisenhaber, F. (2023). About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. Biology Direct, 18, [7]. https://doi.org/10.1186/s13062-023-00362-0

Vancouver

Tantoso E, Eisenhaber B, Sinha S, Jensen LJ, Eisenhaber F. About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. Biology Direct. 2023;18. 7. https://doi.org/10.1186/s13062-023-00362-0

Author

Tantoso, Erwin ; Eisenhaber, Birgit ; Sinha, Swati ; Jensen, Lars Juhl ; Eisenhaber, Frank. / About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. In: Biology Direct. 2023 ; Vol. 18.

Bibtex

@article{5212c8334e7b4f41900dda19c53d5829,
title = "About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature",
abstract = "Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name{\textquoteright}s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible.",
keywords = "Cryptic prophage, Escherichia coli, Gene function discovery rate, Gene function space, Protein function, Uncharacterized genes, yahV, yddL",
author = "Erwin Tantoso and Birgit Eisenhaber and Swati Sinha and Jensen, {Lars Juhl} and Frank Eisenhaber",
note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",
year = "2023",
doi = "10.1186/s13062-023-00362-0",
language = "English",
volume = "18",
journal = "Biology Direct",
issn = "1745-6150",
publisher = "BioMed Central Ltd.",

}

RIS

TY - JOUR

T1 - About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature

AU - Tantoso, Erwin

AU - Eisenhaber, Birgit

AU - Sinha, Swati

AU - Jensen, Lars Juhl

AU - Eisenhaber, Frank

N1 - Publisher Copyright: © 2023, The Author(s).

PY - 2023

Y1 - 2023

N2 - Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible.

AB - Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible.

KW - Cryptic prophage

KW - Escherichia coli

KW - Gene function discovery rate

KW - Gene function space

KW - Protein function

KW - Uncharacterized genes

KW - yahV

KW - yddL

U2 - 10.1186/s13062-023-00362-0

DO - 10.1186/s13062-023-00362-0

M3 - Journal article

C2 - 36855185

AN - SCOPUS:85149153019

VL - 18

JO - Biology Direct

JF - Biology Direct

SN - 1745-6150

M1 - 7

ER -

ID: 339997594