Publication
Extracting medication information from unstructured public health data: a demonstration on data from population-based and tertiary-based samples
Downloadable Content
- Persistent URL
- Last modified
- 05/21/2025
- Type of Material
- Authors
-
-
Robert Chen, Centers for Disease Control and PreventionJoyce Ho, Emory UniversityJin-Mann S. Lin, Centers for Disease Control and Prevention
- Language
- English
- Date
- 2020-10-15
- Publisher
- BMC
- Publication Version
- Copyright Statement
- © The Author(s) 2020
- License
- Final Published Version (URL)
- Title of Journal or Parent Work
- Volume
- 20
- Issue
- 1
- Start Page
- 258
- End Page
- 258
- Grant/Funding Information
- This work was conducted primarily while RC was supported in by an appointment to the Research Participation Program at the Centers for Disease Control and Prevention administered by the Oak Ridge Institute for Science and Education through an inter-agency agreement between the U.S. Department of Energy and the Centers for Disease Control and Prevention.
- Supplemental Material (URL)
- Abstract
- Background: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data. Methods: The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medications and reasons for medication use. We first checked spelling using a spell-check program trained on publicly available knowledge sources and then applied NLP techniques. We mapped medication names into generic names using vocabulary from publicly available knowledge sources. We used WHO’s Anatomical Therapeutic Chemical (ATC) classification system to map generic medication names to medication classes. We processed the reasons for medication with the Lancaster stemmer method and then grouped and mapped to disease classes based on organ systems. Finally, we demonstrated this automation framework on two data sources for Mylagic Encephalomyelitis/ Chronic Fatigue Syndrome (ME/CFS): tertiary-based (n = 378) and population-based (n = 664) samples. Results: A total of 8681 raw medication records were used for this demonstration. The 1266 distinct medication names (omitting supplements) were condensed to 89 ATC classification system categories. The 1432 distinct raw reasons for medication use were condensed to 65 categories via NLP. Compared to completion of the entire process manually, our automation process reduced the number of the terms requiring manual labor for mapping by 84.4% for medications and 59.4% for reasons for medication use. Additionally, this process improved the precision of the mapped results. Conclusions: Our automation framework demonstrates the usefulness of NLP strategies even when there is no established mapping database. For a less established database (e.g., reasons for medication use), the method is easily modifiable as new knowledge sources for mapping are introduced. The capability to condense large features into interpretable ones will be valuable for subsequent analytical studies involving techniques such as machine learning and data mining.
- Author Notes
- Keywords
- Research Categories
- Health Sciences, Health Care Management
- Health Sciences, Public Health
Tools
- Download Item
- Contact Us
-
Citation Management Tools
Relations
- In Collection:
Items
| Thumbnail | Title | File Description | Date Uploaded | Visibility | Actions |
|---|---|---|---|---|---|
|
|
Publication File - vqh4d.pdf | Primary Content | 2025-05-01 | Public | Download |