Blog Post

Researchers are cracking text analysis one dataset at a time

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the [company]New York Times[/company] Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

[pullquote person=”Dr. Olivier Lichtarge” attribution=”Dr. Olivier Lichtarge, Baylor College of Medicine”]”A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.”[/pullquote]

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

For example, in an article about NBA coach Becky Hammon, the blog post’s authors explain:

“‘Basketball’ is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.

“Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once….

“Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.”

This type of work is important, because although there’s a lot of talk about advances in artificial intelligence, the truth is that we have a long way to go before machines can match human capabilities. As the post’s authors also note, “Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task?”

Source: IBM / Baylor University
Source: IBM / Baylor College of Medicine

The end results of accomplishing this mission (and accomplishing it well) will include better search results, sure, but possibly better science and medicine, as well. A team of researchers from [company]IBM[/company] and [company]Baylor College of Medicine[/company] just published a paper as part of the KDD 2014 conference that details work they did to analyze more than 70,000 scholarly articles about a particular protein using IBM’s Watson system. The program they created, called KnIT, analyzed the relationships between the target protein and others based (this is a simplified explanation) on how often and closely they appear in the articles.

The system predicted seven out of nine proteins that have since been determined as important in the realm of tumor suppression.

In a separate paper published as part of KDD 2014, [company]Allen Institute for Artificial Intelligence[/company] researchers detailed a question-answering system designed to read natural language questions and derive answers by scouring public knowledge bases such as Freebase. Oren Etzioni, the Allen Institute’s executive director, used a Monday morning keynote at the conference to talk about that research as well as the institute’s flagship, Project Aristo, which aims to build a system that can reason over what it reads at the same level as a fourth-grader (to begin with).

A diagram of how the Allen Institutes question-answering system breaks down a query. Source: Allen Institute for Artificial Intelligence
A diagram of how the Allen Institutes question-answering system breaks down a query. Source: Allen Institute for Artificial Intelligence

He also discussed a project called Semantic Scholar, which is similar in aim to the IBM-Baylor one, except that it wants to enable semantic search over scholarly papers so researchers don’t need to nail their keywords in order to find what they want over an ever-growing body of work. Going back to the WNBA example in Google’s work, one could imagine searching for papers about the league and missing one that only used the exact term sparingly — if at all — but is nonetheless very relevant.

Of course, these are just the latest examples of many new approaches to language understanding, including recent deep learning projects coming out of places such as [company]Google[/company], [company]Stanford[/company] and DARPA, targeting use cases such as automatically detecting the meanings of words, sentiment analysis and anomaly detection.

One of the Baylor researchers, quoted in a press release about its work with IBM, explained the promise of all this work to build systems that can start to understand what text is actually about: “A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.”

Essentially, computers can read a lot and fast, and programmed correctly can help us find a lot of things we might never have the time to find ourselves.

Correction: The IBM research was in conjunction with Baylor College of Medicine, not Baylor University as originally written.

3 Responses to “Researchers are cracking text analysis one dataset at a time”

  1. Peter Fretty

    I love watching the evolution. The better we can gain perspective to this often unstructured data, the more powerful it becomes as we attempt to leverage, position and gain insights that can lead to meaningful actions. According to a recent SAS survey, organizations have four primary use cases when it comes to unstructured data: insight discovery, process or resource optimization, improved customer experience, and compliance execution.

    Peter Fretty