A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. / Westergaard, David; Stærfeldt, Hans-Henrik; Tønsberg, Christian; Jensen, Lars Juhl; Brunak, Søren.
In: PLoS Computational Biology, Vol. 14, No. 2, e1005962, 2018.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
AU - Westergaard, David
AU - Stærfeldt, Hans-Henrik
AU - Tønsberg, Christian
AU - Jensen, Lars Juhl
AU - Brunak, Søren
PY - 2018
Y1 - 2018
N2 - Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
AB - Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
KW - Abstracting and Indexing as Topic
KW - Area Under Curve
KW - Computational Biology/methods
KW - Data Mining/methods
KW - False Positive Reactions
KW - Genes
KW - Information Storage and Retrieval
KW - MEDLINE
KW - Periodicals as Topic
KW - Proteins/genetics
KW - ROC Curve
KW - Software
KW - Terminology as Topic
U2 - 10.1371/journal.pcbi.1005962
DO - 10.1371/journal.pcbi.1005962
M3 - Journal article
C2 - 29447159
VL - 14
JO - P L o S Computational Biology (Online)
JF - P L o S Computational Biology (Online)
SN - 1553-734X
IS - 2
M1 - e1005962
ER -
ID: 191215867