Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn’t fit neatly into tidy little rows and columns these days, can we please stop calling it “unstructured”? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on search and text processing (including writing a book on the subject) I can’t help but feel that it’s time for the word “unstructured” to be retired and for us to find a better term to describe all of this stuff spewing from us and our computational creations.
Why all the (somewhat tongue-in-cheek) vitriol towards such a simple word? When I’m feeling cynical, I think that, in the early days of databases, someone coined “unstructured” as a derogatory term to mean “all the stuff a database isn’t good at working on.” If “structured” is good, then “un”-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It’s too full of promise to describe with such a diminutive word as “unstructured.” Numerical data? Child’s play! Text? Now there’s a real challenge.
Text is easily one of the most highly structured data types we face, filled with misspellings, misdirection, flowery language, ambiguity and implicit knowledge. Text is so often misunderstood that researchers in the field even have a metric (inter-annotator agreement) that tracks how often two people examining the same piece of text agree on the answer to some question on the text. Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can’t, for whatever reason, write a complete sentence. I hate them and love them all, all at the same time. How good do you think a computer can be at parsing a single sentence that spans multiple pages, much less try to make sense of a sentence that doesn’t even follow basic grammar rules?
Sure, we’ve made great progress, especially in recent years, in dealing with rich data like text, and big data and deep learning techniques hold even more promise to unlocking some of the mystery. We can now detect the end of a sentence with a high degree of accuracy, find sentiment in tweets and locate the mentions of people on a page, just like your average fifth grader!
Yet, despite all of these advances, we need at least an order of magnitude advance (if not two or more) in our ability to process rich data across a variety of domains for us to truly harness the opportunity this data presents in moving civilization forward. I hope you’ll forgive me if the word “unstructured” leaves me feeling a bit empty inside when I think about that opportunity and the lack of inspiration it provides to potential contributors. As for me, I’ll start by calling it “rich data” from here on out, windmills be damned.
Grant Ingersoll, who will be speaking at Structure:Data on March 20-21, is the CTO and cofounder of LucidWorks. He also coauthored Taming Text, cofounded Apache Mahout and is a long-standing committer on the Apache Lucene and Solr open source projects. He’s engineered a variety of search, question-answering and natural-language processing applications. You can follow Grant on Twitter @gsingers.