Weekly Update

When Does Data Become Complex?

Last week, GigaOM’s Structure Big Data event reached New York City, and attendees delved deeply into the subject of dealing with data at scale. For those unable to make it to New York, Brett Sheppard’s latest report, Putting Big Data to Work: Opportunities for Enterprises, provides an introduction to the topic and some valuable pointers on the next generation of opportunities and challenges in big data.

In particular, Sheppard suggests big data typically exhibits one or more of the following: volume, complexity or speed. (It should be noted that the industry as a whole lacks definitive metrics in all three of these areas; questions such as how big is big enough, how structured or inter-connected is complex and how near to real time speed takes us remain largely unanswered questions.) Volume and speed are, for the most part, reasonably uncontroversial criteria, but complexity is a concept that would benefit from further exploration.

Much big data is extremely complex on an obvious level, with large numbers of data elements or intricate inter-dependencies between different areas of a data set. But some very valuable data streams are actually very simple. Sheppard, for example, quotes UC Berkeley’s Joe Hellerstein, who discusses “data factories” that read product barcodes, GPS locations and RFID tags. These, along with things like online clickstreams, tweets and the constant flow of data from smart meters, Electronic Point of Sale (EPoS) devices and urban traffic management systems are all extremely simple in nature. A pair of geographic coordinates linked to a textual description of a place is not complex by any measure, for example. Nor is a single stream of numbers representing domestic power utilization every hour, every minute or even every second. Yet these simple geographic coordinates locating individual houses and businesses, combined with the separate streams of simple meter data for power consumption at each location, become complex to integrate, manipulate, visualize and manage in a timely fashion.

It’s in these combinations and recombinations of data sets (whether individually simple or not) that some of the greatest opportunities for new social, scientific and business insights lie. Relatively well understood — and essentially routine —technological advances in storage, transfer and querying will ensure that we continue to see ever-greater volumes of data processed with ever-more speed. Effectively handling complexity requires advances that are certainly possible, but far less straightforward, since they combine hardware, software and a wide range of human skills.

Visualization is one way of reducing complexity. Take, for instance, Nathan Yau’s FlowingData site, which explores how designers, scientists and others interpret data through pictures, including last week’s visualization of Los Angeles traffic. Sheppard, meanwhile, uses LinkedIn’s InMaps as an example of visualization, and this is an example that myself and others have also used to demonstrate the way in which good visualization can reduce complexity in order to let patterns and trends emerge; hundreds or thousands of contacts are grouped and displayed in such a way that one’s real network of acquaintances can be seen, making it easy to identify areas in which new business opportunities might most easily be found.

Visualization, though, is as dangerous as it is powerful. Compelling pictures are inherently believable, but they may mask poor data, shoddy analysis and deliberate attempts at distortion. Elections are one area in which the same data can be visualized to support different points of view; this was illustrated during the 2008 U.S. Presidential election, where different visualizations of the same data led to strikingly different interpretations of voter behavior. Alongside the tools for analysis and display, we need the skills to understand, interpret and explain data. Data communicators may end up being far more important than the technologies upon which they will depend.

Question of the week

How is complex data defined, and how can we best manage it?