Hortonworks is making progress on its mission (via a project called Stinger) to speed up SQL-like queries in Hadoop using Apache Hive. New features in the latest version of Hortonworks’ Hadoop distribution have improved Hive performance tens of times in some instances, and the company is aiming for 100x improvements soon. Hortonworks has also added support for new types of SQL data. Competitor Cloudera opted to forgo Hive in favor of its own Impala technology for interactive queries.
eBay has acquired Seattle-based price-prediction startup Decide.com, and the service will shut down on Sept. 30. The entire team will head over to eBay to help the e-commerce giant improve its experience through predictive modeling. The entire team except Co-founder and CTO Oren Etzioni, that is: the University of Washington computer science professor, Madrona Venture Group partner and former Farecast founder is heading up Paul Allen’s new Allen Institute for Artificial Intelligence.
Google has added another new capability to its BigQuery analytics service. This one lets users derive correlation values between similar data points, something Google highlighed using sensor data from its recent I/O conference. Read more »
Location-data startup Placed has been tracking the businesses that consumers visit for about a year, and now it’s tying that data to their TV habits and interests. Where do “The Biggest Loser” viewers hang out? Bakeries. Read more »
Big data startup HStreaming is now part of Swiss advertising firm Adello Group. HStreaming had standout technology by all accounts, but the business never scaled enough to survive in a tough market. Read more »
There’s some business context here around GoDaddy’s new focus on building products to help small businesses become relevant in a digital world, but there’s also video of Jean-Claude Van Damme playing the pan pipes. Read more »
This post from Slate is spot on, in my humble opinion. It might be overkill, but I can say the same about my own posting habits, and did last year. (I can’t say the same about my wife, though …) There are plenty of reasons to not want a digital profile you didn’t ask for, and advances in behavioral analysis and facial recognition are only making them worse.
Marathon is a new framework that turns Mesos — a favorite of Twitter — into a more dynamic tool for running different applications on a single set of machines. Marathon comes from a startup called Mesosphere, founded by two former Airbnb engineers who know Mesos cold. Read more »
SwiftKey, a London-based startup that sells a popular “smart” keyboard for Android devices, has closed a $17.5 million series B led by Index Ventures. The company plans to spend the money on research to “fuel further innovation in the fields of Natural Language Processing and Machine Learning,” among other things, according to a press release. That’s probably not a bad idea given Google’s vested interest keyboard dominance and focus on cutting-edge text analysis.
Twitter has open sourced a “streaming MapReduce” system called Summingbird that makes Hadoop and Storm play nicer together so applications that require both batch and stream processing can do their jobs with as little complexity as possible. Read more »
Former VMware CTO Steve Herrod joined General Catalyst Partners in January, and his first investment as a venture capitalist is a big one — $25 million in cloud backup service Datto. DealBook has a good writeup of Datto’s story, but the other angle is what the deal says about Herrod’s investment strategy and about GCP’s push into enterprise software.
Ex-NASA CIO and CTO, and current Nebula co-founder and CEO, Chris Kemp came on the Structure Show podcast this week to talk about everything from upgrading NASA’s infrastructure to commercializing the OpenStack software he helped create while there. Here are some highlights. Read more »
Facebook is hosting a Kaggle competition in order to identify candidate for a data scientist position. Résumés are so passé when you can just have applicants prove their skills first. Read more »
A provocative — and thoroughly researched — post from IEEE Spectrum about the shortage of workers with science, technology, engineering and math skills. I’m not skeptical enough to think it’s all manufactured concern so employers can keep salaries low, but I’ve read enough about the push for more immigrant visas for tech workers to know there’s something there.
Researchers have released a tool that lets anyone track the whereabouts of Twitter and Instagram users who allow geotagging of their posts. They want social media users to be aware that geotagging exists and what kind of information it provides. Read more »
I’d argue this is a prime example of when metadata is used correctly. If the other nearly 150,000 phone numbers were never investigated and the records were deleted once the feds found their guys, any invasion of privacy is only theoretical. There’s a big difference between this and GPS-tracking, or what the NSA is doing.
A London-based startup called import.io has built a service that lets users take information from websites and turn it into structured data that can populate a spreadsheet or feed an application via API. And it doesn’t require any coding. Read more »
File-sharing service Hotfile was found guilty of copyright infringement in a U.S. federal court case decided on Wednesday. But just because Hotfile appears guilty, that doesn’t mean cyberlockers are inherently evil — regardless what the MPAA says. Read more »
Hortonworks has released a set of icons for illustrating the roles of various Hadoop-ecosystem components in flow charts and other architectural diagrams. Earth-shattering? No. Helpful if you’re stuck trying to build a PowerPoint slide about your big data environment? Probably. Read more »
LinkedIn’s new University Pages are a case study in how to build a big data application. Ideas are great and pretty web design are great, but you also need people who can find and format the data, the the systems in place to make everything work. Read more »
Couchbase, a startup selling a NoSQL database of the same name, has raised a $25 million series D round. Adams Street Partners led the round and was joined by existing investors Accel Partners, Mayfield Fund, North Bridge Venture Partners and Ignition Partners. Couchbase doesn’t have the huge user base of MongoDB or the edginess of HBase, but it does have some big-name users (including Orbitz) and the company claims sales jumped 400 percent in the last year.
How much does the U.S. government request data from U.S. web properties? A lot. Here are eights charts showing data from Facebook, Google, Microsoft and Twitter about how many requests they get from across the globe. Read more »
MongoDB creator 10gen has changed its name to MongoDB, Inc. It’s probably not a bad idea to align the company’s name with the its sole product, but it will take a little getting used to. Read more »
Violin Memory has filed for a $173 million initial public offering, although it did so without much of the hype traditionally associated with Violin news. The company is on pace for $100 million in revenue this year, but it’s now part of a crowded flash market. Read more »
Hadoop-based analytics startup Tresata last week open sourced a set of machine learning libraries built on Scalding and designed to run in Hadoop and make use of the Apache Mahout project. Tresata is calling the project Ganita, and has also written a couple of explanatory blog posts about it, including how to do k-means clustering. The barriers to doing good work on big data just keep getting lower.
Publishing analytics startup Parse.ly has raised $5 million and has released its first report showing the top sources of traffic across its customer base. It claims hundreds of them, including big-name ones like Atlantic Media, Reuters and Mashable. Read more »
Based on the data scientists I’ve met and the “how to become a data scientist” talks I’ve seen, it’s hard to disagree. But SQL and coding skills can be really helpful if you need need to get stuff done beyond pure statistical analysis.
Amazon Web Services experienced a brief outage on Sunday afternoon. It only last about 60 minutes, but appears to have taken down popular sites such as Instagram, Flipboard and Vine for short periods. Read more »
Google cloud platform manager Greg DeMichillie was on our Structure Show podcast this week to defend Google’s position in the cloud computing market. He makes some fair points, but will they be enough to lure in developers and companies en masse? Read more »
This is a good presentation about Facebook’s graph-processing engine, Giraph, from a big data event held at the company’s Menlo Park campus in early June. The PRISM story kind of took over the news cycle that week, but the event also produced some news (for big data geeks, at least): Facebook’s Presto engine for interactive queries of its 250-petabyte Hadoop data warehouse.
Researchers have a devised a method for identifying fake Twitter accounts that proved highly accurate across 27 popular black-market merchants. With Twitter’s cooperation, they spotted and deleted millions of accounts, using only data generated during the account-registration process. Read more »
The last day’s NSA headlines have been about how it broke the law and even violated the Constitution. But that’s just a small part of an opinion that raises more questions than answers, and that underscores the complex nature of data privacy. Read more »
It’s natural to hear all the hype about big data and sense a bubble is forming, but the speakers at this year’s Structure: Europe conference have proven it’s for real — and they know how to make it happen. Read more »
A database vendor called Objectivity has created a mobile app called GraphMyLife that aims to let consumers explore links between the people and content in their various social networks. I say “aims” because although the idea is pretty cool, the app is a bit laggy and confusing (at least on my phone). But cut Objectivity a break: it’s a specialized (and old) enterprise-tech company trying to humanize its graph database software.
A data science consultancy has published a report analyzing the design of retirement- and investment-industry websites, but the lessons are universal: Better design means better business. Read more »
Facebook, Ericsson, MediaTek, Nokia, Opera, Qualcomm and Samsung are launching an initiative called internet.org that aims to connect the whole world with internet access via cheaper devices, better business models and better infrastructure. Read more »
In a candid interview last week, Hortonworks CEO Rob Bearden discussed a variety of topics — including personnel, profitability and a public offering — in some detail. Hortonworks is a Hadoop startup that spun out of Yahoo in June 2011. Read more »
10gen has added some new features to its MongoDB connector for Hadoop, including support for Hive and the ability to backup MongoDB files in HDFS. Read more »
Business intelligence and analytics startup Birst has raised a $38 million Series E round led by Sequoia Capital. Birst has been very busy in the past couple years, moving from SaaS to on-prem software, rethinking the data warehouse and even launching a Hadoop-based service. It looks like Birst is positioned to test the IPO waters like Qliktech and Tableau before it.
Cleversafe, a Chicago-based provider of object-storage systems for housing massive amounts of data, has raised a $55 million series D round led by New Enterprise Associates. Apart from traditional storage workloads, Cleversafe has also made a name for itself as a replacement for HDFS in Hadoop environments. According to Crunchbase, the company has now raised $91.4 million since 2007.