24 Comments

Summary:

LucidWorks’ Grant Ingersoll argues that it’s time to stop using language to diminish the importance of text, one of the defining computational challenges of our time.

Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn’t fit neatly into tidy little rows and columns these days, can we please stop calling it “unstructured”? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on search and text processing (including writing a book on the subject) I can’t help but feel that it’s time for the word “unstructured” to be retired and for us to find a better term to describe all of this stuff spewing from us and our computational creations.

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.Why all the (somewhat tongue-in-cheek) vitriol towards such a simple word? When I’m feeling cynical, I think that, in the early days of databases, someone coined “unstructured” as a derogatory term to mean “all the stuff a database isn’t good at working on.” If “structured” is good, then “un”-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It’s too full of promise to describe with such a diminutive word as “unstructured.” Numerical data? Child’s play! Text? Now there’s a real challenge.

Text is easily one of the most highly structured data types we face, filled with misspellings, misdirection, flowery language, ambiguity and implicit knowledge. Text is so often misunderstood that researchers in the field even have a metric (inter-annotator agreement) that tracks how often two people examining the same piece of text agree on the answer to some question on the text. Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can’t, for whatever reason, write a complete sentence. I hate them and love them all, all at the same time. How good do you think a computer can be at parsing a single sentence that spans multiple pages, much less try to make sense of a sentence that doesn’t even follow basic grammar rules?

Sure, we’ve made great progress, especially in recent years, in dealing with rich data like text, and big data and deep learning techniques hold even more promise to unlocking some of the mystery. We can now detect the end of a sentence with a high degree of accuracy, find sentiment in tweets and locate the mentions of people on a page, just like your average fifth grader!

Yet, despite all of these advances, we need at least an order of magnitude advance (if not two or more) in our ability to process rich data across a variety of domains for us to truly harness the opportunity this data presents in moving civilization forward. I hope you’ll forgive me if the word “unstructured” leaves me feeling a bit empty inside when I think about that opportunity and the lack of inspiration it provides to potential contributors. As for me, I’ll start by calling it “rich data” from here on out, windmills be damned.

Grant Ingersoll, who will be speaking at Structure:Data on March 20-21, is the CTO and cofounder of LucidWorks. He also coauthored Taming Text, cofounded Apache Mahout and is a long-standing committer on the Apache Lucene and Solr open source projects. He’s engineered a variety of search, question-answering and natural-language processing applications. You can follow Grant on Twitter @gsingers.

 

  1. especially in certain information domains, like health care and electronic medical records, “unstructured” does seem like the best word to describe whats happening or not happening. progress notes written into a health record are rich data, as you propose, with importance and meaning; but unencoded, its low in value and leads to interpretation, duplication in assessments and diagnostics, and other inefficiency. so we could call it rich, unstructured, unencoded, or more optimistcally, harvestable.

    Share
  2. I find it ironic that on this page for the article “Can we please stop saying “unstructured” data?” there is an advert for “Gigaom Structure Data” (conference).

    Share
    1. Tim,

      Indeed. It was intentional, as I am speaking at ‘Structure’!

      -Grant

      Share
  3. J. Andrew Rogers Sunday, March 17, 2013

    Similar to your cynical definition above, I have asserted for many years that “unstructured” means “structured data my database handles poorly”. This is why there are so many conflicting definitions about what is and is not unstructured data.

    Data has structure by definition. If you look at the internals of a database designed for “unstructured” data, no matter how you choose to define it, the representation of that data is highly structured. People are conflating optimization choices made by their data platforms with the intrinsic properties of the data they put in their platforms.

    Share
  4. Michael Brill Sunday, March 17, 2013

    Not sure I understand why “unstructured” is inaccurate. Maybe “poorly-structured” or “structured with a bewildering array of interpretations?” Obviously there are plenty of scenarios where natural language is the only reasonable content format, but for most of commerce it’s nearly useless.

    Maybe instead of throwing so much money at something that is so complex, we put some of that towards making it easier for people to translate their intent into structured content. There’s got to be something in-between typing in natural language and entering data into a form.

    Share
    1. Michael,

      While it’s somewhat tongue in cheek, I guess my argument would be that it language and other rich data is usually highly structured, which is why it doesn’t fit neatly into traditional database approaches. I do like your “bewildering array” view, but that doesn’t exactly roll off the tongue.

      FWIW, most of the work these days on handling rich data is to put it into a format that is more consumable by computers (i.e. triple stores, etc.)

      Share
  5. I agree that the term is not perfect. Moreover neither is the term structured data. It isn’t only the neat structure of relational (table-based) data that is significant. It also connotes are bias towards a particular use case for ‘structured’ data – record keeping and transaction processing.

    When folk refer to the difficulty of managing unstructured data they are often indicating – by way of contrast – information that is resistant to the assembly-line metaphor to which transactional data is so readily subjected by being structured into tables.

    And yes the thorniest version of unstructured data is text – but is also includes non-text data like image, video and audio.

    A deeper issue is the binary nature of the language as well – hierarchical data in directories and xml applications of various semantic flavours are often edge-cases that are somewhat easier to process than human produced text by still substantially trickier than neatly formed two-dimensional records.

    But in the end I don’t think it really matters. As much as it pains me from time to time, it isn’t the word that matters so much as its meaning and what people intend and understand when it is used.

    In my professional life as an Enterprise Architect, and earlier as an Information Architect, I have generally found that people understand exactly was it meant by ‘unstructured data’ – what it denotes and connotes and why it matters to understand it.

    That the word is not causing confusion and poor thinking is really the test of it. And I really haven’t seen any evidence of confused ignorance or harmful misunderstanding in my work.

    Share
    1. When you say that “the test of it is that it doesn’t cause confusion and poor thinking”, that seems similar to me writing an article called “Can we please stop saying dude!”, and then someone making a comment of “Well, the word dude doesn’t seem to be causing confusion and poor thinking”. Okay, bad example, I like the word dude.

      But in a similar vein to your statement, I would argue that the author doesn’t really feel like Don Quixote, because Don Quixote is a fictional character that I don’t believe anyone actually feels like, he’s just an enjoyable and common literary reference.

      Share
  6. Michael Brill Sunday, March 17, 2013

    I guess what I’m suggesting is that we may be able to avoid the harder parts of NLP by helping people communicate in a more structured way in the first place. It’s not so much about database storage or query processing per se, but more about facilitating the elicitation and capture of explicit intent… ideally as close as possible to the thoughts in our heads.

    The argument is that if we can capture intent in a semantically unambiguous way as the user is expressing themselves then life gets a lot simpler and interesting. If we wait until the user expresses themselves, unaided by technology, and then try to figure out what in the world they’re saying later, it’s obviously much harder.

    I guess this is part of a larger gripe that we have made it so easy to create low signal data that we now have to invest *massively* in a downstream NLP/big data/machine learning layer to try to get some value out of the mess of data that’s being created. And, as you mentioned, we’re very far from being able to do that beyond a 5th grade level.

    I realize this is a bit off-topic from your original post, but I wonder what you think about pushing processing back to the point of data creation where the user has a chance to disambiguate or even extend their communication – ultimately to help them reach the right recipients instead of getting lost in the cloud.

    Share
    1. Michael Brill Sunday, March 17, 2013

      BTW, a very primitive version of this would be hashtags. It’s kind of a grunt when it comes to semantics, but it’s a start. FB interest graph is another step in the right direction.

      Share
  7. In my view we will soon talk about Data, the definitions between them will disapear as the big data environment matures.

    Share
  8. I’m not sure I share your sense of injustice.
    I don’t think there’s a pervading sense that “structured=good, unstructured=bad” to be confronted.
    Nor do I think there is much to be redeemed in arguing that, actually, text is “highly structured”.

    For me, “structured” is often the dirty word here – suggesting system-imposed constraints to expression whereas “unstructured” is the more liberated, data type, allowing greater freedom of human expression.

    There is a spectrum of data types which runs from the very precise” to what you might call “free-form”.
    In order of precision:
    1) a primary key – good for identifying exactly one thing only
    2) a boolean flag – representing 2 things e.g. dead or alive
    3) a drop-down field to identify a range of “top-down” classifications
    4) a hashtag – kind of free-form, but a “bottom-up” attempt at imposing classification
    5) “free-text” or video – capable of expressing endless concepts

    Unstructured or free-form content is not a failure to provide structure – its usually a deliberate choice to allow this freedom of expression. I don’t feel calling it “unstructured” content is in any way demeaning.

    Share
    1. Hi Mark,

      I like the “greater freedom of expression” viewpoint and fully appreciate your view. In my view the word unstructured’s meaning depends on who it is using it. For people who, like you, embrace it, it probably is the simplest way (although rich data is pretty nice, IMO, too). Often times, though, I find people use it to mean all the other stuff that we don’t know how to handle very well using our big ol’ database hammer.

      Share
      1. As we move away from the trusty database hammer and incorporate more of the unstructured with things like faceted search we create a fundamental usability problem – how do we satisfactorily explain that the beautifully structured results (pie charts, timelines, heatmaps) are all founded on some pretty fuzzy matching underneath (“more like this”, fallible synonyms etc)?
        Database users are used to formal set logic (we either count something or we don’t) while search is more about fuzzy sets (things match to a degree).
        Not seen a good answer to this question yet – or even general acknowledgement that it is a problem.

        Share
  9. Luis Moreno Campos Monday, March 18, 2013

    100% Agree. Saying “unstructured” has been an easy way to say “variety”, but the increasing intelligence in multi-data mining algorithms has made the term almost offensive.

    Share
  10. Grayson Daughters Monday, March 18, 2013

    “Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can’t, for whatever reason, write a complete sentence.” That made my Quote Du Jour!

    And yeah, “Text Du Jour” may be “unstructured”, but I bet I could trademark it. In other words, if you have a way with words you have a gift, not a data base.

    Share

Comments have been disabled for this post