Datafiniti re-architects its web-crawler engine and moves to the cloud

1 Comment

Datafiniti, a startup that wants to apply search to big data applications, has rejiggered the technology behind its 80legs web crawler and moved to the cloud. With its new system, CEO and founder Shion Deysarkar claims that a user can scan every single homepage on the internet in roughly a week and then load that data into whatever database he may be using.

Although there’s a lot of information out in the internet that could be useful for an organization to harness — Yelp reviews, websites with customer feedback information, etc. — unless a company has a lot of man-power, it’s difficult to be able to take all of that data and make it presentable.

That’s where 80legs comes in. Using a web crawler (a fancy word for an automated script) a user can essentially automate the time-consuming task of scanning thousands of websites to find out whatever it is that she wants to discover. After gathering that content, Datafiniti can dump that data (in the form of JSON logs) into any database the user prefers, and she now has easy access to loads of web content.

Datafiniti used to take “a local data center approach” as the infrastructure behind its 80legs system, but found that it was difficult to scale; the startup also wanted to harness newer cloud technology like the open-source Redis key-value and in-memory database, which Pinterest employs to keep track of its users and their followers.

The startup decided to go with Amazon Web Services as its cloud, which also connects with volunteer computers around the globe; Deysarkar said there are now 70,000 machines available for Datafiniti to tap into so scaling is no longer an issue.

80legs Diagram

80legs Diagram

“All of these machines in our system are basically sitting there,” said Deysarkar. “They spend their lives asking for work from the master distributor.”

Datafiniti has also revamped its interface, which now lets users sign in on a web form to generate a crawl or if they know JavaScript, they can write a custom script to search for a particular bit of information.

With the new 80legs, Deysarkar said PayPal has been able to scan millions of merchant websites to learn if those sites may be frauds; with the data gleaned from the web crawlers, [company]PayPal[/company] is hoping to prevent those scammers to take advantage of its customers.

Since the new 80legs rolled out last February, Deysarkar said it has doubled its customer base, which includes [company]MailChimp[/company], Cox Digital Solutions and Healthcare MDM.

Post and thumbnail images courtesy of Shutterstock user pattara puttiwong.

1 Comment

Comments are closed.