Can we please stop saying “unstructured” data?

24 Comments

Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn’t fit neatly into tidy little rows and columns these days, can we please stop calling it “unstructured”? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on search and text processing (including writing a book on the subject) I can’t help but feel that it’s time for the word “unstructured” to be retired and for us to find a better term to describe all of this stuff spewing from us and our computational creations.

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.Why all the (somewhat tongue-in-cheek) vitriol towards such a simple word? When I’m feeling cynical, I think that, in the early days of databases, someone coined “unstructured” as a derogatory term to mean “all the stuff a database isn’t good at working on.” If “structured” is good, then “un”-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It’s too full of promise to describe with such a diminutive word as “unstructured.” Numerical data? Child’s play! Text? Now there’s a real challenge.

Text is easily one of the most highly structured data types we face, filled with misspellings, misdirection, flowery language, ambiguity and implicit knowledge. Text is so often misunderstood that researchers in the field even have a metric (inter-annotator agreement) that tracks how often two people examining the same piece of text agree on the answer to some question on the text. Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can’t, for whatever reason, write a complete sentence. I hate them and love them all, all at the same time. How good do you think a computer can be at parsing a single sentence that spans multiple pages, much less try to make sense of a sentence that doesn’t even follow basic grammar rules?

Sure, we’ve made great progress, especially in recent years, in dealing with rich data like text, and big data and deep learning techniques hold even more promise to unlocking some of the mystery. We can now detect the end of a sentence with a high degree of accuracy, find sentiment in tweets and locate the mentions of people on a page, just like your average fifth grader!

Yet, despite all of these advances, we need at least an order of magnitude advance (if not two or more) in our ability to process rich data across a variety of domains for us to truly harness the opportunity this data presents in moving civilization forward. I hope you’ll forgive me if the word “unstructured” leaves me feeling a bit empty inside when I think about that opportunity and the lack of inspiration it provides to potential contributors. As for me, I’ll start by calling it “rich data” from here on out, windmills be damned.

Grant Ingersoll, who will be speaking at Structure:Data on March 20-21, is the CTO and cofounder of LucidWorks. He also coauthored Taming Text, cofounded Apache Mahout and is a long-standing committer on the Apache Lucene and Solr open source projects. He’s engineered a variety of search, question-answering and natural-language processing applications. You can follow Grant on Twitter @gsingers.

 

24 Comments

Grant

Mark Harwood,

(for some reason, I can’t reply inline to your last reply)

Excellent points. I think you are starting to already see a new wave of education happening around it with the rise of machine learning and search via dominant companies like Google, FB, etc. as well as the big data movement such that people are being forced to think in probabilistic terms more and more just to stay afloat in the sea of data.

Jonathan

I agree. The term “Unstructured” is inappropriate when you think about the data that is likely to be stored. However, when talking about the worst case scenarios of what *could* be stored the term fits exactly. In those fields it is overwhelmingly more often than not that the data needs to be handled as if there were no structure at all. How many if these “Unstructured” fields are filtered by some sort of imposed NLP parsed data type after all? So we use that term to describe how we need to handle the field, not what the field may contain. However, when our free text data eventually becomes forced to be human readable then we will need a new term. I don’t mind “Rich Data” but the problem there is that there are many levels of more ir less structured data that could be considered rich. Today, instead of using the term “Unstructred” I prefer calling it “Unrestricted Data”. That describes both the way we should handle it and the potential content.

Craig

I agree with the headline, too many people do indeed refer to data in XML, JSON and the like as unstructured, but they are just as structured as the relational tuple model.

ksankar

Multi-structured is a term that makes a lot of sense because there are many interpretations depending on the context. I am almost sure we will see ‘multi-structure’ will replace the ‘structure’ conference in a few years ;o)

Grant

I sometimes use multi-structured as well, but it doesn’t exactly roll off the tongue.

Joe

Structure is semantics, and all data has semantics in a given context, thus all data is structured.

If you don’t know what your data means and how to interpret it, game over. The terms are nice for marketing, but little more.

Grayson Daughters

“Authors like Faulkner and Joyce treat language as an art form, yet Joe Forum User can’t, for whatever reason, write a complete sentence.” That made my Quote Du Jour!

And yeah, “Text Du Jour” may be “unstructured”, but I bet I could trademark it. In other words, if you have a way with words you have a gift, not a data base.

Luis Moreno Campos

100% Agree. Saying “unstructured” has been an easy way to say “variety”, but the increasing intelligence in multi-data mining algorithms has made the term almost offensive.

Mark Harwood

I’m not sure I share your sense of injustice.
I don’t think there’s a pervading sense that “structured=good, unstructured=bad” to be confronted.
Nor do I think there is much to be redeemed in arguing that, actually, text is “highly structured”.

For me, “structured” is often the dirty word here – suggesting system-imposed constraints to expression whereas “unstructured” is the more liberated, data type, allowing greater freedom of human expression.

There is a spectrum of data types which runs from the very precise” to what you might call “free-form”.
In order of precision:
1) a primary key – good for identifying exactly one thing only
2) a boolean flag – representing 2 things e.g. dead or alive
3) a drop-down field to identify a range of “top-down” classifications
4) a hashtag – kind of free-form, but a “bottom-up” attempt at imposing classification
5) “free-text” or video – capable of expressing endless concepts

Unstructured or free-form content is not a failure to provide structure – its usually a deliberate choice to allow this freedom of expression. I don’t feel calling it “unstructured” content is in any way demeaning.

Grant

Hi Mark,

I like the “greater freedom of expression” viewpoint and fully appreciate your view. In my view the word unstructured’s meaning depends on who it is using it. For people who, like you, embrace it, it probably is the simplest way (although rich data is pretty nice, IMO, too). Often times, though, I find people use it to mean all the other stuff that we don’t know how to handle very well using our big ol’ database hammer.

Mark Harwood

As we move away from the trusty database hammer and incorporate more of the unstructured with things like faceted search we create a fundamental usability problem – how do we satisfactorily explain that the beautifully structured results (pie charts, timelines, heatmaps) are all founded on some pretty fuzzy matching underneath (“more like this”, fallible synonyms etc)?
Database users are used to formal set logic (we either count something or we don’t) while search is more about fuzzy sets (things match to a degree).
Not seen a good answer to this question yet – or even general acknowledgement that it is a problem.

Nico

In my view we will soon talk about Data, the definitions between them will disapear as the big data environment matures.

Michael Brill

I guess what I’m suggesting is that we may be able to avoid the harder parts of NLP by helping people communicate in a more structured way in the first place. It’s not so much about database storage or query processing per se, but more about facilitating the elicitation and capture of explicit intent… ideally as close as possible to the thoughts in our heads.

The argument is that if we can capture intent in a semantically unambiguous way as the user is expressing themselves then life gets a lot simpler and interesting. If we wait until the user expresses themselves, unaided by technology, and then try to figure out what in the world they’re saying later, it’s obviously much harder.

I guess this is part of a larger gripe that we have made it so easy to create low signal data that we now have to invest *massively* in a downstream NLP/big data/machine learning layer to try to get some value out of the mess of data that’s being created. And, as you mentioned, we’re very far from being able to do that beyond a 5th grade level.

I realize this is a bit off-topic from your original post, but I wonder what you think about pushing processing back to the point of data creation where the user has a chance to disambiguate or even extend their communication – ultimately to help them reach the right recipients instead of getting lost in the cloud.

Michael Brill

BTW, a very primitive version of this would be hashtags. It’s kind of a grunt when it comes to semantics, but it’s a start. FB interest graph is another step in the right direction.

RJP

I agree that the term is not perfect. Moreover neither is the term structured data. It isn’t only the neat structure of relational (table-based) data that is significant. It also connotes are bias towards a particular use case for ‘structured’ data – record keeping and transaction processing.

When folk refer to the difficulty of managing unstructured data they are often indicating – by way of contrast – information that is resistant to the assembly-line metaphor to which transactional data is so readily subjected by being structured into tables.

And yes the thorniest version of unstructured data is text – but is also includes non-text data like image, video and audio.

A deeper issue is the binary nature of the language as well – hierarchical data in directories and xml applications of various semantic flavours are often edge-cases that are somewhat easier to process than human produced text by still substantially trickier than neatly formed two-dimensional records.

But in the end I don’t think it really matters. As much as it pains me from time to time, it isn’t the word that matters so much as its meaning and what people intend and understand when it is used.

In my professional life as an Enterprise Architect, and earlier as an Information Architect, I have generally found that people understand exactly was it meant by ‘unstructured data’ – what it denotes and connotes and why it matters to understand it.

That the word is not causing confusion and poor thinking is really the test of it. And I really haven’t seen any evidence of confused ignorance or harmful misunderstanding in my work.

Mark Miller

When you say that “the test of it is that it doesn’t cause confusion and poor thinking”, that seems similar to me writing an article called “Can we please stop saying dude!”, and then someone making a comment of “Well, the word dude doesn’t seem to be causing confusion and poor thinking”. Okay, bad example, I like the word dude.

But in a similar vein to your statement, I would argue that the author doesn’t really feel like Don Quixote, because Don Quixote is a fictional character that I don’t believe anyone actually feels like, he’s just an enjoyable and common literary reference.

Michael Brill

Not sure I understand why “unstructured” is inaccurate. Maybe “poorly-structured” or “structured with a bewildering array of interpretations?” Obviously there are plenty of scenarios where natural language is the only reasonable content format, but for most of commerce it’s nearly useless.

Maybe instead of throwing so much money at something that is so complex, we put some of that towards making it easier for people to translate their intent into structured content. There’s got to be something in-between typing in natural language and entering data into a form.

Grant

Michael,

While it’s somewhat tongue in cheek, I guess my argument would be that it language and other rich data is usually highly structured, which is why it doesn’t fit neatly into traditional database approaches. I do like your “bewildering array” view, but that doesn’t exactly roll off the tongue.

FWIW, most of the work these days on handling rich data is to put it into a format that is more consumable by computers (i.e. triple stores, etc.)

J. Andrew Rogers

Similar to your cynical definition above, I have asserted for many years that “unstructured” means “structured data my database handles poorly”. This is why there are so many conflicting definitions about what is and is not unstructured data.

Data has structure by definition. If you look at the internals of a database designed for “unstructured” data, no matter how you choose to define it, the representation of that data is highly structured. People are conflating optimization choices made by their data platforms with the intrinsic properties of the data they put in their platforms.

Tim Hodgson

I find it ironic that on this page for the article “Can we please stop saying “unstructured” data?” there is an advert for “Gigaom Structure Data” (conference).

Peter

especially in certain information domains, like health care and electronic medical records, “unstructured” does seem like the best word to describe whats happening or not happening. progress notes written into a health record are rich data, as you propose, with importance and meaning; but unencoded, its low in value and leads to interpretation, duplication in assessments and diagnostics, and other inefficiency. so we could call it rich, unstructured, unencoded, or more optimistcally, harvestable.

Comments are closed.