Publication

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Downloadable Content

Persistent URL
Last modified
  • 05/15/2025
Type of Material
Authors
    Payam Karisani, Emory UniversityZhaohui Qin, Emory UniversityEugene Agichtein, Emory University
Language
  • English
Date
  • 2018-01-01
Publisher
  • Oxford University Press (OUP): Policy C - Option A
Publication Version
Copyright Statement
  • © The Author(s) 2018. Published by Oxford University Press.
License
Final Published Version (URL)
Title of Journal or Parent Work
ISSN
  • 1758-0463
Volume
  • 2018
Grant/Funding Information
  • This work was partially funded by NIH Grant 1U24AI117966-01 (Year 3 Pilot projects), and by Emory University.
Abstract
  • The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie.
Author Notes
  • Eugene Agichtein Department of Computer Science, Mathematics & Science Center, Emory University, Suite W401, 400 Dowman Drive NE, Atlanta, Georgia 30322, USA Corresponding author: Tel: +1 (404) 727-7962; Fax: +1 (404) 727-5611; Email: eugene.agichtein@emory.edu
Research Categories
  • Mathematics
  • Computer Science

Tools

Relations

In Collection:

Items