$1.9 million NIH grant helps UD computer scientists make biomedical research easier to peruse

Whether you visit The New York Times or Buzzfeed, you probably first glance at images and then scan headlines and captions to see what the articles are about. Now, imagine that you could sort through tens of thousands of articles and get an idea of what they are about in just a few minutes.

This ability could accelerate fundamental, potentially life-saving biomedical research. Researchers often need to examine hundreds and even thousands of published journal articles in short order. Finding the pivotal, evidence-supported information within the publication is a tricky task when essential experimental evidence lies in images and captions that are not automatically searchable by Google or PubMed.

Large-scale analysis of images along with the text could soon be possible, thanks to new research led by Hagit Shatkay, an associate professor of computer and information sciences at the University of Delaware. She has received a $1.9 million grant from the National Institutes of Health to pursue this direction.

Accelerating biomedical research progress

Biomedical researchers around the world study a variety of conditions by examining gene expression and mutation in model organisms, such as yeast, fruit flies, C. elegans (worms), and mice. Understanding these species’ genes helps researchers discover how similar genetic abnormalities in humans elevate our risk of health problems.

When scientists publish their discoveries in journals, the papers often include graphs, charts, photographs, and other images. For example, papers about genetic discoveries often include images from gel electrophoresis experiments, which are used to identify fragments of proteins or DNA. These images look like columns with bands or blots of varied light and dark shades that convey important information.

“In biology, a lot of the evidence lies in the images,” said Shatkay.

However, this information can be difficult to find. Conventional search engines such as Google or Pubmed do not use image contents as part of their search. While Google assigns text-tags or keywords to images as labels, so that you can enter certain search terms and find images labeled by these terms, no public search engine uses the images themselves as a basis for finding other related images or documents.

Even conceptually simpler image-related tasks, such as identifying individual image panels within compound scientific figures, are not well performed by conventional tools, and require attention. A new tool called FigSplit, recently developed by Pengyuan Li – a doctoral student in Shatkay’s lab—and published by Shatkay and her collaborators in the premier journal Bioinformatics, supports one such task.

Under the new grant, Shatkay and a team of collaborators including Liz Marai from the University of Illinois at Chicago, and UD’s Cecilia Arighi, Chandra Kambhamettu and Cathy Wu, aim to identify figures, classify them and combine them with text in order to improve identification of scientific evidence.

“The goal of this project is to make discovery easier, to help researchers more readily find exactly what they need,” said Shatkay.

The tools Shatkay and her collaborators are developing could be especially useful to scientific database curators, who combine and organize material obtained from the literature into databases so that the most up-to-date information is available for researchers designing new experiments. The technology is being developed in collaboration with curators in Europe and in the U.S. at centers such as the Jackson Laboratory in Maine and the California Institute of Technology’s WormBase Project.

Shatkay’s research could also benefit healthcare experts who analyze medical and clinical data to predict patient outcomes, such as the likelihood of complication or recovery.

And although this system is being designed for use with biomedical publications, its use could extend to other academic journals or even consumer publications such as newspapers, magazines, and more.

“We are pushing the envelope in analyzing documents in general,” said Shatkay.

The researchers are developing and using machine learning methods, which allow computers to adapt their actions based on data, and to speed up organization and classification of the information they are collecting. Shatkay explains it like this: A typical machine-learning scenario starts with much data. Say you have many examples of a complex phenomenon (such as images of melanoma versus benign tumors, or records of fraudulent credit card transactions versus non-fraudulent ones). You can train a machine to recognize the phenomenon—identifying melanoma or catching fraudulent transactions, without having to specify explicitly what constitutes a melanoma image, or what is fraud.

“The underlying typical assumption is that when we have much data, we have many examples to learn from, and the machine can learn effectively,” she said. However, learning becomes much harder when you are looking at a rare phenomenon, like a disease that affects less than 0.001 percent of the population, experiments that are not often performed, or rare genomic mutations. You essentially have a lot of data that is irrelevant to what you are looking for and only a small amount that is relevant. For instance, you may have massive volumes of data from healthy patients and only a tiny bit from sick ones.

“Doing machine learning in the face of such imbalance is a hurdle that got little attention thus far, and my group focuses on addressing it,” said Shatkay. They presented some of this research at the 2017 IEEE International Conference on Bioinformatics and Biomedicine, and more work is in the pipeline. (The Institute of Electrical and Electronics Engineers goes by IEEE, except in legal documents.)

For this project, Shatkay is collaborating with several investigators, both from outside UD (Liz Marai, an expert on scientific visualization and biomedical image analysis, from the University of Illinois at Chicago), and from UD including Cathy Wu, the Unidel Edward G. Jefferson Chair in Engineering and Computer Science; Chandra Kambhamettu, professor of computer and information sciences; and Cecilia Arighi, an expert biocurator, chair of the International Society for Biocuration  (ISB) and a research associate professor at the Center for Bioinformatics and Computational Biology.

Wu, a leading figure in the field of bioinformatics, has developed multiple resources that will be utilized in this project, including the Protein Information Resource and Protein Ontology. Kambhamettu is an expert in computer vision and image analysis. Arighi concentrates on the applicability of the tools being designed, providing relevant literature for protein-protein interaction experiments and much insight into the biocuration process and the evaluation of the tools.

Pengyuan Li and Xiangying Jiang, both doctoral students in computer science, also contribute to the project. Past contributors include Scott Sorenson, Xiaolong Wang, Abhishek Kolagunda, Zhenzhu Zheng and Kaidi Ma.

Together, they all aim to push the frontiers of knowledge.

“So far we’ve been relying a lot on the text to find information, and figures have a lot to say that we haven’t uncovered yet,” Arighi said.

 | Photo by Kathy F. Atkinson |