A Guide to Dictionary-Based Text Mining

Research output: Chapter in Book/Report/Conference proceedingBook chapterResearchpeer-review

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

Original languageEnglish
Title of host publicationBioinformatics and Drug Discovery
EditorsRichard S. Larson, Tudor I. Oprea
Number of pages17
Volume1939
PublisherHumana Press
Publication date2019
Edition3
Pages73-89
ISBN (Print) 978-1-4939-9088-7
ISBN (Electronic)978-1-4939-9089-4
DOIs
Publication statusPublished - 2019
SeriesMethods in Molecular Biology
ISSN1064-3745

ID: 223876548