The COVID-19 pandemic is the most devastating public health crisis in at least a century and has affected the lives of billions of people worldwide in unprecedented ways. Compared to pandemics of this scale in the past, societies are now equipped with advanced technologies that can mitigate the impacts of pandemics if utilized appropriately. However, opportunities are currently not fully utilized, particularly at the intersection of data science and health. Health-related big data and technological advances have the potential to significantly aid the fight against such pandemics, including the current pandemic’s ongoing and long-term impacts. Specifically, the field of natural language processing (NLP) has enormous potential at a time when vast amounts of text-based data are continuously generated from a multitude of sources, such as health/hospital systems, published medical literature, and social media. Effectively mitigating the impacts of the pandemic requires tackling challenges associated with the application and deployment of NLP systems. In this paper, we review the applications of NLP to address diverse aspects of the COVID-19 pandemic. We outline key NLP-related advances on a chosen set of topics reported in the literature and discuss the opportunities and challenges associated with applying NLP during the current pandemic and future ones. These opportunities and challenges can guide future research aimed at improving the current health and social response systems and pandemic preparedness.
BACKGROUND: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. METHODS AND RESULTS: We included free-text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non-Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer-based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held-out patient data using the F1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (P<0.05), and both natural language processing models outperformed ICD code–based classification (P<0.05). The sliding window strategy improved performance over the base model (P<0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. CONCLUSIONS: Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.
Migraine is a highly prevalent and disabling neurological disorder. However, information about migraine management in real-world settings is limited to traditional health information sources. In this paper, we (i) verify that there is substantial migraine-related chatter available on social media (Twitter and Reddit), self-reported by those with migraine; (ii) develop a platform-independent text classification system for automatically detecting self-reported migraine-related posts, and (iii) conduct analyses of the self-reported posts to assess the utility of social media for studying this problem. We manually annotated 5750 Twitter posts and 302 Reddit posts, and used them for training and evaluating supervised machine learning methods. Our best system achieved an F1 score of 0.90 on Twitter and 0.93 on Reddit. Analysis of information posted by our 'migraine cohort' revealed the presence of a plethora of relevant information about migraine therapies and sentiments associated with them. Our study forms the foundation for conducting an in-depth analysis of migraine-related information using social media data.
Social media platforms are increasingly being used by intimate partner violence (IPV) victims to share experiences and seek support. If such information is automatically curated, it may be possible to conduct social media based surveillance and even design interventions over such platforms. In this paper, we describe the development of a supervised classification system that automatically characterizes IPV-related posts on the social network Reddit. We collected data from four IPV-related subreddits and manually annotated the data to indicate whether a post is a self-report of IPV or not. Using the annotated data (N=289), we trained, evaluated, and compared supervised machine learning systems. A transformer-based classifier, RoBERTa, obtained the best classification performance with overall accuracy of 78% and IPV-self-report class 𝐹1 -score of 0.67. Post-classification error analyses revealed that misclassifications often occur for posts that are very long or are non-first-person reports of IPV. Despite the relatively small annotated data, our classification methods obtained promising results, indicating that it may be possible to detect and, hence, provide support to IPV victims over Reddit.
Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.
Background Opioid use disorder (OUD) is a major public health crisis for which buprenorphine-naloxone is an effective evidence-based treatment. Analysis of Reddit data yields detailed information about firsthand experiences with buprenorphine-naloxone that has the potential to inform treatment of OUD. Methods We conducted a thematic analysis of posts about buprenorphine-naloxone from a Reddit forum in which Reddit users anonymously discuss topics related to opioid use. We used an application programming interface to retrieve posts about buprenorphine-naloxone, then applied natural language processing to generate meta-information and curate samples of salient posts. We manually categorized posts according to their content and conducted natural language processing-aided analysis of posts about buprenorphine tapering strategies, withdrawal symptoms, and adjunctive substances/behaviors useful in the tapering process. Results A total of 16,146 posts from 1933 redditors were retrieved from the /r/suboxone subreddit. Thematic analysis of sample posts (N = 200) revealed descriptions of personal experiences (74%), nonpersonal accounts (24%), and other content (2%). Among redditors who reported tapering to termination (N = 40), 0.063 mg and 0.125 mg were the most common termination doses. Fatigue, gastrointestinal disturbance, and mood disturbance were the most frequent adverse effects, and loperamide and vitamins/dietary supplements the most frequently discussed adverse effects adjunctive substances/behaviors respectively. Conclusions Discussions on Reddit are rich in information about buprenorphine-naloxone. Information derived from analysis of Reddit posts about buprenorphine-naloxone may not be available elsewhere and may help providers improve treatment of people with OUD through better understanding of the experiences of people who have used buprenorphine-naloxone.
Introduction: Medications such as buprenorphine and methadone are effective for treating opioid use disorder (OUD), but many patients face barriers related to treatment and access. We analyzed two sources of data&amp;amp;#x2014;social media and published literature&amp;amp;#x2014;to categorize and quantify such barriers. Methods: In this mixed methods study, we analyzed social media (Reddit) posts from three OUD-related forums (subreddits): r/suboxone, r/Methadone, and r/naltrexone. We applied natural language processing to identify posts relevant to treatment barriers, categorized them into insurance- and non-insurance-related, and manually subcategorized them into fine-grained topics. For comparison, we used substance use-, OUD- and barrier-related keywords to identify relevant articles from PubMed published between 2006 and 2022. We searched publications for language expressing fear of barriers, and hesitation or disinterest in medication treatment because of barriers, paying particular attention to the affected population groups described. Results: On social media, the top three insurance-related barriers included having no insurance (22.5%), insurance not covering OUD treatment (24.7%), and general difficulties of using insurance for OUD treatment (38.2%); while the top two non-insurance-related barriers included stigma (47.6%), and financial difficulties (26.2%). For published literature, stigma was the most prominently reported barrier, occurring in 78.9% of the publications reviewed, followed by financial and/or logistical issues to receiving medication treatment (73.7%), gender-specific barriers (36.8%), and fear (31.5%). Conclusion: The stigma associated with OUD and/or seeking treatment and insurance/cost are the two most common types of barriers reported in the two sources combined. Harm reduction efforts addressing barriers to recovery may benefit from leveraging multiple data sources.
Intimate partner violence (IPV) increased during the COVID-19 pandemic. Collecting actionable IPV-related data from conventional sources (e.g., medical records) was challenging during the pandemic, generating a need to obtain relevant data from non-conventional sources, such as social media. Social media, like Reddit, is a preferred medium of communication for IPV survivors to share their experiences and seek support with protected anonymity. Nevertheless, the scope of available IPV-related data on social media is rarely documented. Thus, we examined the availability of IPV-related information on Reddit and the characteristics of the reported IPV during the pandemic. Using natural language processing, we collected publicly available Reddit data from four IPV-related subreddits between January 1, 2020 and March 31, 2021. Of 4,000 collected posts, we randomly sampled 300 posts for analysis. Three individuals on the research team independently coded the data and resolved the coding discrepancies through discussions. We adopted quantitative content analysis and calculated the frequency of the identified codes. 36% of the posts (n = 108) constituted self-reported IPV by survivors, of which 40% regarded current/ongoing IPV, and 14% contained help-seeking messages. A majority of the survivors’ posts reflected psychological aggression, followed by physical violence. Notably, 61.4% of the psychological aggression involved expressive aggression, followed by gaslighting (54.3%) and coercive control (44.3%). Survivors’ top three needs during the pandemic were hearing similar experiences, legal advice, and validating their feelings/reactions/thoughts/actions. Albeit limited, data from bystanders (survivors’ friends, family, or neighbors) were also available. Rich data reflecting IPV survivors’ lived experiences were available on Reddit. Such information will be useful for IPV surveillance, prevention, and intervention.
Background: Social media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. Objective: Our objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. Methods: We created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. Results: FDA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed misinformation, information, political, and conspiracy theories to be prominent topics. Conclusions: Our proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network–based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.
OBJECTIVES: Xylazine is an alpha-2 agonist increasingly prevalent in the illicit drug supply. Our objectives were to curate information about xylazine through social media from People Who Use Drugs (PWUDs). Specifically, we sought to answer the following: 1) what are the demographics of Reddit subscribers reporting exposure to xylazine? 2) is xylazine a desired additive? and 3) what adverse effects of xylazine are PWUDs experiencing? METHODS: Natural Language Processing (NLP) was used to identify mentions of "xylazine" from posts by Reddit subscribers who also posted on drug-related subreddits. Posts were qualitatively evaluated for xylazine-related themes. A survey was developed to gather additional information about the Reddit subscribers. This survey was posted on subreddits that were identified by NLP to contain xylazine-related discussions from March 2022 to October 2022. RESULTS: 76 posts mentioning xylazine were extracted via NLP from 765,616 posts by 16,131 Reddit subscribers (January 2018 to August 2021). People on Reddit described xylazine as an unwanted adulterant in their opioid supply. 61 participants completed the survey. Of those that disclosed their location, 25/50 (50%) participants reported locations in the Northeastern United States. The most common eoute of xylazine use was intranasal use (57%). 31/59 (53%) reported experiencing xylazine withdrawal. Frequent adverse events reported were prolonged sedation (81%) and increased skin wounds (43%). CONCLUSIONS: Among respondents on these Reddit forums, xylazine appears to be an unwanted adulterant. PWUDs may be experiencing adverse effects such as prolonged sedation and xylazine withdrawal. This appeared to be more common in the Northeast.