As unstructured data heats up, will you need a license to webcrawl?

Cheap computing and the ability to store a lot of data at a low cost have made the concept of big data a big business. But amid the frenzy to gather that data, especially unstructured information scraped from or accessed via crawling web sites, companies might be pushing the boundaries of polite (or ethical) behavior. They may also be stealing valuable IP. So is it stoppable and could the current solutions lead to the demise of the open web?

The web is full of all kinds of data, some in data-friendly formats such as CSV files and others in indecipherable text or pricing formats that require companies to clean it before shoving it in a database to use it. Companies such as Infochimps or Microsoft’s Windows Azure Marketplace are trying to take some of the messier files and offer them to people. Other companies, such as Factual and Datafiniti, are building businesses based on scraping the content from sites and then creating customized databases for clients.

And scraping is complex. The act of indexing a web page and then pulling the data from it can be a beneficial action, such as when Google indexes your web site, but not everyone is a good scraper. When done without regard to the host site it can suck up a site’s bandwidth or even appropriate their intellectual property. Some argue that the behavior is problematic, while others argue that preventing it hurts consumers, society and maybe even the open web. So should one have a license to webcrawl?

If everyone scrapes, sites suffer?

Datafiniti gets its data from the webscraping efforts of its parent company 80 Legs. Shion Deysarkar, the CEO of both companies, explained that the access to all kinds of data via scraping web sites is becoming more a competitive differentiator, but also explains that getting it isn’t the most challenging thing in the world. For example, 80 Legs offers a web-crawling service, but anyone could build their own now that the cost of crawling the web has gone down significantly.

But pinging a web site to grab its information exacts a toll on the site, and an overzealous crawler or hundreds of sites gathering data at any one time could create problems for the crawled site. A bunch of rapid web crawls can look similar to a denial of service attack, simply because the site has to respond to so many requests.

It’s an issue the founders of BlackLocus considered as they built their company. BlackLocus scrapes pricing data of web sites in order to help deliver its online price-matching service for retailers. It’s IP is in the ability to match products across different sites quickly and easily, but to get that information it first has to crawl the web. BlackLocus does the crawling itself, but COO Eric Rob Taylor says it tries to follow the rules set forth by the sites it crawls.

However, I’ve spoken with less-ethical scrapers and crawlers who will resort to tricks such as trying to impersonate a Google crawler to try to rapidly access a site either to copy its content, as happens often to news organizations on the web, or to just grab information for later packaging.

On the flip side I have met a startup that attempts to prohibit other crawlers or at least rein them in. Sean Harmer, VP of business development at, a startup that offers publishers a service that prevents scrapers from stealing their content, confirmed that such tricks are on the rise.

But shouldn’t data be free?

But what about the flip side of the debate? Not all scrapers and crawlers are out to defraud publishers of ad revenue — some are foisting their robots on the web in a legitimate way to offer consumers a service as BlackLocus does, while others use it for academic research. Even journalists scrape data from web sites for their stories.

Still, there are efforts to halt even these practices. For example, any time you have click to put something in your shopping basket to see the price while you are shopping at an online retailer, you’re going through that hoop as a means to keep a scraper from seeing that price. And when it comes to researching and accessing certain forms of unstructured data, such as Facebook or Twitter feeds, the default is to pay for access — no scraping allowed.

A burgeoning number of businesses such as Gnip are paying companies, including Twitter, to access their data firehoses, in hopes of selling services around their ability to access that raw data. Facebook keeps its data close to the vest, but has apparently started making use of it via partnership deals. This frustrates Kevin Burton, the CEO of Spinn3r, a company that makes its money “indexing the blogosphere.”

Burton, whose company also provides users access to social media data, believes about 60 percent of the web has been “walled off” from services such as his. This is up from about 10 percent since he formed Spinn3r in 2005. He says the closed off nature of the social web today, such as not being able to crawl Twitter or Amazon’s (s amzn) user reviews, places all kinds of constraints on innovation and legitimate research.

“The market has shifted. When we started, the web was very open, but you would never be able to start a Google nowadays,” he said. “You would have to license a large percentage of your data from Twitter. All sorts of people are calling it the social web, but in the data it’s anti social.”

The data is asset and the service is the value.

Unfortunately for Burton, the web isn’t going to become more open as time goes on. Most businesses recognize that their value is in owning the end user and the end users’ data, whether or not the end user herself recognizes that. Twitter’s massive valuation isn’t based on its platform, it’s based on the information it has about the tweets hitting its servers ever second.

Even a company like Yelp, which went public based on the content provided by users — content that it vigorously defended from Google’s indexing — is taking advantage of user-generated data to enrich itself. Protecting that asset from becoming scraped, commoditized and turned into revenue for others seems like a no-brainer.

Yet, if the web does end up divided into a handful of data fiefdoms connected by APIs, the open web and some of the freedom that has allowed services to flourish there will diminish in importance. The move from freedom to implementing some kind of legal thicket is an almost-certain stop in any industry on its way to maturity (look at the mobile phone industry and the current patent fights), but like growing up in general, something precious will have been lost.