What do tech luminaries Andy Bechtolsheim, Sky Dayton, Joi Ito and Brad Garlinghouse have in common? They’re all backing Diffbot, the startup that’s building visual robot technology that parses web site content to make it easier to reuse.
Diffbot, the first company funded out of Stanford’s StartX accelerator program, makes its APIs available to users wanting to extract the components of web pages in a way that makes that content reusable and easier to mash up into apps, Diffbot founder and CEO Michael Tung told me this week. It’s identified 18 web page types and the API handles two of them — front page and article — to date and is building support for the others. GigaOM’s Ryan Kim covered the launch of Diffbot’s first APIs last fall.
Unlocking web content
“We’ve got this great thing, the Internet, full of web pages, the problem is they’re made for human beings to read and understand, particularly people in front of a browser … but that’s inaccessible to software applications, hundreds of thousands of apps like Siri(s aapl), that only work with a handful of APIs that they’re hard-coded for,” Tung said.
“Yelp is great for searching places, Flipboard is great for discovering news. Our main insight is the web can be broken down into 18 types of pages, news, people,places, photos, etc. and our goal is to teach a machine to understand all that,” Tung said. The company is working on more APIs to bring all that content into its reach.
At a recent hackathon, one participant built a web reader for his blind father using Diffbot’s APIs. “For a blind person, using the web is miserable. [Today’s] screen readers read all the text starting at the top, including the nav bar and scroll down. Diffbot analyses that page, determines the title, author, text and can read it in a more natural way,” Tung said.
Diffbot can look at web pages created for human beings and analyze them visually so the app can treat the web as a big data base. It is now processing more 100 million API calls monthly for software developers using the service for Web site mobilization, tag generation and other functions.
Bechtolsheim, the founder of Sun Microsystems(s Orcl); Sky Dayton, founder of Earthlink and Boingo; Joi Ito, director of the MIT Media Lab: Brad Garlinghouse, a former Yahoo(s yhoo) exec and now CEO of YouSendIt (see disclosure) all invested in this $2 million seed round as did Jonathan Heiliger, the Facebook(s fb) vet now at North Bridge Venture Capital Partners.
The company is using a freemium model, encouraging developers and others to submit URLs to the system for content extraction. The service is free up to a certain number of API calls. “We want to apply Diffbot to the entire web, but it’s expensive to build a web crawler; we only analyze the URLs that people send us,” Tung said.
John Davi, Diffbot’s VP of product and a Cisco(s csco) veteran, said the submissions in themselves will be valuable. “Our long-term vision is to avail ourselves of the cream of the content that comes out. We’ll be able to see the important pages — the articles and recipes that people submit — and we think there’s value in knowing that.”
Disclosure: YouSendIt is backed by Alloy Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media.