About this item:

6 Views | 1 Download

Author Notes:

To whom correspondence should be addressed. Matthew.Scotch@asu.edu

AM designed and trained the neural network, ran the experiments, performed the error analysis and wrote most of the manuscript.

DW proposed the idea of using of distant supervision for improving the CRF NER’s performance in the previous manuscript, created the distant supervision dataset, supervised the experiments and wrote revisions of the manuscript.

AS reviewed, restructured and contributed many sections and revisions of the manuscript.

MS and GG provided overall guidance on the work and edited the final manuscript.

The authors would also like to acknowledge Karen O’Connor, Megan Rorison and Briana Trevino for their efforts in the annotation processes.

The authors are grateful to the anonymous reviewers for their valuable feedback and comments to improve the quality of the paper.

The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Conflict of Interest: none declared.


Research Funding:

Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011.


  • Science & Technology
  • Life Sciences & Biomedicine
  • Technology
  • Physical Sciences
  • Biochemical Research Methods
  • Biotechnology & Applied Microbiology
  • Computer Science, Interdisciplinary Applications
  • Mathematical & Computational Biology
  • Statistics & Probability
  • Biochemistry & Molecular Biology
  • Computer Science
  • Mathematics

Deep neural networks and distant supervision for geographic location mention extraction


Proceedings Title:


Conference Name:

26th Annual Conference on Intelligent Systems for Molecular Biology (ISMB)


Conference Place:

Chicago, IL


Volume 34 | Issue 13

Publication Date:

Type of Work:

Conference | Final Publisher PDF


Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous stateof- the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER?s capability to embed external features to further boost the system?s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

Copyright information:

© The Author(s) 2018. Published by Oxford University Press. All rights reserved.

This is an Open Access work distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/).

Creative Commons License

Export to EndNote