Hadoop

Cloudera bought DataPad because data scientists need tooling, too

Cloudera has acquired a data-visualization startup called DataPad, the founding team of which specializes in data analysis using the Python programming language. As Hadoop competition heats up, Cloudera might be ramping up its Python tooling in order to attract more data scientists and developers.

Data wranglers, don’t fear. Trifacta is here

Making sense of big data can be hard enough without spending untold hours having to write code or manually clean datasets that simply won’t work with existing BI tools. Trifacta is trying to automate that process with a new software product it announced on Tuesday.

Hadoop 2 is finally GA

We have been hearing about things like YARN and high availability for a few years — they’ve even been incorporated into some commercial Hadoop…

Apache Giraph at Analytics @ Web Scale

https://www.facebook.com/photo.php?v=10151890165548109&set=vb.9445547199&type=2&theater This is a good presentation about Facebook’s graph-processing engine, Giraph, from a big data event held at the company’s Menlo Park…

Is Qubole proving a demand for Hadoop in the cloud?

Hadoop-in-the-cloud startup Qubole says its customers used more than 100,000 nodes to run more than 350,000 jobs and process more than a petabyte of data in July. Those aren’t Facebook numbers, but they seem to signal an appetite among smaller users.

CSC buys Infochimps and its big data platform

IT services and consulting specialist CSC has acquired Infochimps, a startup that sells a big data query and processing platform. Infochimps had raised about $5 million in equity and debt financing since launching in 2009.

Cloudera buys machine learning startup Myrrix

Hadoop vendor Cloudera has acquired its first company, a London-based machine learning startup called Myrrix. Machine learning is becoming a big use case for big data, and Cloudera is wise to get some expertise in-house.

Netflix shows off how it does Hadoop in the cloud

Netflix is at it again, this time showing off its homemade architecture for running Hadoop workloads in the Amazon Web Services cloud. It’s all about the flexibility of being able to run, manage and access multiple clusters while eliminating as many barriers as possible.

Meet 7 startups that could define the Chinese cloud

I recently spent 11 days in Beijing meeting lots of companies trying to make it in cloud computing and big data. Here are seven with which I had a chance to sit down and learn about their businesses and how to sell cloud computing in China.

A big data top 20 for 2012

A lot happened in the world of data analysis this year. Here’s a list of the most-popular and generally most-interesting things I’ve had the fortune to cover in 2012 — from Hadoop to the Supreme Court to Bollywood stars.

Continuuity gets $10M to free Hadoop from itself

Hadoop is nothing without applications, and Continuuity aims to deliver those apps by making Hadoop something developers can work and innovate with. Its efforts haven’t gone unnoticed — the company just closed a $10 million Series A round from a who’s who of big data VCs.

Rackspace versus Amazon: The big data edition

Rackspace is busy building a Hadoop service, giving the company one more avenue to compete with cloud kingpin Amazon Web Services. However, the two services — along with several others on the market — highlight just how different seemingly similar cloud services can be.

Forget your fancy data science, try overkill analytics

Carter S. won his first-ever Kaggle competition — our own GigaOM WordPress Challenge — using a brute force method of data science he calls overkill analytics. Rather than spend untold hours perfecting complex models, Carter used simple algorithms and let powerful microprocessors do the rest.

Who’s connected to whom in Hadoop world [infographic]

To say there are a lot of companies involved in the Hadoop ecosystem would be an understatement. To say partnership strategies are broad would be one, too. The folks at Datameer created this infographic to show just how expansive and interconnected the Hadoop ecosystem is.

Survey: Hadoop clusters not that big, not changing the world (yet)

The results of a recently released survey from Hadoop-focused startup Karmasphere show that while Hadoop use is picking up among mainstream (read “non-web”) companies, it’s still far from the all-powerful and ubiquitous insight engine its supporters (myself included) believe it will become.

Netflix analyzes a lot of data about your viewing habits

Netflix’s algorithms for recommending movies to customers might not be perfect, but it isn’t for lack of trying. The company is capturing and analyzing incredible amounts of data, even from the videos themselves, to try and figure out what you want to watch next.

How Facebook keeps 100 petabytes of Hadoop data online

It’s no secret that Facebook stores a lot of data in Hadoop, but how it keeps that data available whenever it needs it isn’t necessarily common knowledge. Today at the Hadoop Summit Facebook Engineer Andrew Ryan highlighted that solution, which Facebook calls AvatarNode.

VMware aims for Hadoop on VMs with ‘Serengeti’ project

VMware is launching a new open source project, called “Serengeti,” that aims to let the Hadoop data-processing platform run on the virtualization leader’s vSphere hypervisor. VMware apparently smells a lucrative opportunity in Hadoop and isn’t about to miss out on getting a piece of the pie.

How big data helps Ancestry.com map people, places and time

Online genealogy service Ancestry.com is trying to become like the Amazon or Netflix of family trees. Much like those companies use customer data to recommend products or movies customers might like, Ancestry.com is using machine learning to make learning about ancestors a lot less work.

10 ways companies are using Hadoop (for more than ads)

If you just pay attention to largest Hadoop users, you might think the platform is just a way of powering search engines or analyzing customer behavior for ad-serving. Of course that’s not the case, but finding those broader use cases can still be kind of difficult.

How Climate Corp. is pitting big data against Mother Nature

When your business is to insure farmers against the effects of bad weather, you’d better have some seriously accurate data on your side. Mother Nature, after all, can be somewhat unpredictable. The Climate Corporation thinks the answer is lots of data and lots computing power.

VMware buys big data startup Cetas

VMware has acquired Cetas, a startup that provides analytics atop the Hadoop platform. Terms of the deal haven’t been disclosed, but Cetas is an 18-month-old company with tens of paying customers that didn’t need to rush into an acquisition. So, why did VMware buy it?

Satellite imagery and Hadoop mean $70M for Skybox

Skybox Imaging, a startup that wants to capture and analyze high-resolution photos and videos of the Earth, has raised $70 million in Series C funding. The money will help Skybox its lineup of software engineers and data scientists that might be its secret sauce.

Hadoop in 17 syllables

If you’re an amateur poet and love big data, high-performance system vendor AMAX has a deal for you. The company is conducting a contest to find the best haiku on big data. But I’m sharing my poems right here.

Meet TempoDB, a database startup with an eye for time

TempoDB, a startup out of Chicago, has build a database as a service offering specifically for time-series data thrown off by thermostats, servers, automotive telematics. Does the world (or the Internet of Things) need a specialty time series database hosted in the cloud?

Sungard wants to sell you Hadoop as a service

Managed hosting provider Sungard is getting into the big data space with a new Hadoop service that gives users on-demand access to the popular data-storage and processing platform. Called Unified Analytics Service, Sungard’s new offering joins the growing ranks of cloud-based Hadoop offerings.

Microsoft snags Yahoo chief scientist

Raghu Ramakrishnan, who was the top scientist for several of Yahoo’s key technology efforts, is now a technical fellow with Microsoft’s server and tools unit. This is the latest sign that Yahoo is struggling to retain key technologists.

Is big data new, or have we forgotten its old heroes?

Seemingly overnight, big data became the behemoth to conquer. But the truth is, tried and true technologies have been tackling the problem for years. Versant’s Robert Greene gives respect to three unsung heroes of big data.

Is machine learning coming to a system near you?

If you like the idea of your analytics system’s getting more accurate with each piece of data it ingests, it looks like you are in for an exciting run, because machine learning appears to be catching fire across the ecosystem of big data vendors.

LexisNexis puts MarkLogic to work in big data makeover

LexisNexis is pressing MarkLogic’s technology into service for its just-launched Lexis Advance legal service. MarkLogic’s document storage, search and analytics technology replaces legacy home-built code as part of a platform modernization and big data push.

HPCC Systems tunes big data platform for Amazon cloud

HPCC Systems, the division of LexisNexis that’s pushing a big-data processing-and-delivery platform to compete with Hadoop, has tuned its software to run on Amazon’s cloud computing platform. Interested developers can now experiment with the open source software without having to wrangle physical servers.

Why some execs think Hadoop ain’t all that

So the big data backlash begins. The Hadoop framework does a lot, but some experts — including those who push non-Hadoop options — say it’s not enough for many specialized apps where a build-your-own Hadoop implementation costs too much to be a real contender.

Big data reveals Mac users book pricier hotels

It appears as if Apple users’ willingness to shell out a little more cash for a premium experience doesn’t stop at computers. Orbitz’s data-crunching has found that Mac users also spend about $20 more a night on hotels than do Windows users.

Splunk connects with Hadoop to master machine data

Splunk has integrated its product with Apache Hadoop to enable large-scale batch analytics on top of the product’s existing capabilities around real-time search, analysis and visualization of machine-generated data. Users can bring Hadoop data back into Splunk for visualization or run MapReduce jobs from Splunk.

With $40M for Cloudera, how much is Hadoop worth?

Hadoop isn’t the only thing going in big data, but it’s driving the bus at this point and it seems to have a reverse Midas touch: everything that touches it turns to gold. The latest to experience this is Cloudera, which has raised another $40 million.

Cloudera founder’s new project shows Hadoop’s future

Cloudera founder Christophe Bisciglia launched a new company today called Odiago, whose WibiData product utilizes Hadoop and HBase to let businesses make the most of online user data. Big-name investors aside, under the covers WibiData shows the future of how Hadoop-based products will look.

IBM doing Hadoop as a service in its cloud

IBM joined the big-data-in-the-cloud fray, announcing Monday that its Hadoop-based InfoSphere BigInsights product will be available as a service on the IBM SmartCloud platform. Big Blue’s timing is good, as Hadoop will likely have a far greater presence across public clouds within the next year.

Why Uncle Sam might be ready for Hadoop in the cloud

The federal government has been gung ho over cloud computing in the past few years, but is it ready to do big data in the cloud? Federal contractor GCE Federal is offering a cloud service based on Hadoop and designed for federal agencies to outsource analytics.

How Amazon uses big data to prevent warehouse theft

Amazon has become the cloud king, with its AWS offerings providing cloud-based storage and processing that takes a lot of the cost out of deploying new products and applications. Netflix, DropBox and Yelp are all AWS clients, but the most important user might be Amazon itself.

Top 5 things to watch for at Oracle OpenWorld

Oracle customers have lots of questions for the database giant. If you’re one of the 50,000 people Oracle expects to converge on the Moscone Center starting Sunday–or even if you’re not–here are some key things to look out for at the big Oracle OpenWorld 2011 Conference.

Yes, VMware has plans for Hadoop, big data

Much like everyone has some product or strategy to optimize on “the cloud,” momentum is already gathering around the next big technology trend to drive buzzzwords — big data. VMware is no exception, so I spoke with Steve Herrod, the company’s CTO to find out more.

Karmasphere pushes new big data workflow

Hadoop is all the rage in analytics, but it still isn’t easy for mere mortals to utilize the big data framework. A handful of companies are trying to solve this problem, including Karmasphere with the latest version of its Analyst Big Data product.

Big data meets Bruce Perens: an open-source “covenant”

Balancing an open-source community with commercial interests can be difficult, which is why HPCC Systems sought the help of Bruce Perens before open-sourcing its eponymous big-data-processing software. Essentially, the company either ensures the existence of a free version or pulls contributed code.

Concurrent raises $900K to make Hadoop easier

Concurrent, the company providing the Cascading data workflow API, has raised a $900,000 seed round to capitalize on the newfound excitement around Hadoop. Cascading is an open-source API for creating and running data workflows atop Hadoop clusters.

Are companies addicted to Hadoop?

The size of Hadoop deployments appears to have tripled since October, according to statistics that Cloudera is sharing. If accurate, they help prove assumptions that Hadoop usage grows quickly once organizations wrap their heads around how it is used.

Twitter buys BackType to dig deeper with big data

Twitter announced Tuesday it has acquired BackType, an analytics platform aimed at helping companies and brands gauge their social media impact. The possible rationale for the deal is BackType’s Storm real-time big data processing platform that could help Twitter offer well-defined analytics.

Ravel open-sources tool for analyzing graph data like Google

Ravel now offers an open-source graph database that looks to bring the benefit’s of Google’s Pregel project to the masses. Graph databases don’t get the attention of other big-data technologies such as Hadoop or NoSQL, but every Twitter user is familiar with what they can do.

EMC Makes a Big Bet on Hadoop

EMC is throwing its weight behind Hadoop. Today, at the EMC World, the storage giant announced a slew of Hadoop-centric products, including a specialized appliance for Hadoop-based big data analytics and two separate Hadoop distributions. EMC’s entry is going to shake-up the Hadoop market.

Syncsort Adds More Fuel to Hadoop Fire

Data-integration specialist Syncsort is releasing two new Hadoop tools that it says will give Hadoop users a better, faster experience than they can achieve using Apache Hadoop alone. Unlike some other recent announcements, however, Syncsort is looking to improve Hadoop rather than replace aspects of it.

Can Yahoo, Cloudera and IBM Split the Hadoop Pot?

If Yahoo plans to spin off its white-hot Hadoop business, it would make Yahoo the third vendor operating alongside Cloudera and IBM — fighting for what, right now, are only speculative customer dollars. Would Yahoo’s spinout have what it takes to compete?

Why Cloudera Isn’t Sweating the Hadoop Competition

Cloudera released version 3.0 of its distribution of Apache Hadoop (CDH3) Tuesday. CDH3 is a big reason why, despite a recent spate of Hadoop-based big data products either on the market or about to be there, Cloudera says it isn’t sweating all the new competition.

Hadoop: From Boardrooms to Newsrooms

A handful of new releases and partnerships this week — as well as a big award — illustrate just how versatile the data-processing tool Hadoop is and how widespread its use might become. Hadoop is becoming a more viable tool for everyone from business users to journalists.

Why Big Data Will Need Big Gear

Hardware rarely comes up in discussions about big data, save for those centered on data warehouse appliances. But the omission hardly means hardware is irrelevant. In fact, big gear might become a big deal as companies look to bolster the performance of their big data systems.

Why Big Data Startups Should Take a Narrow View

One of the statements that struck me most from Structure: Big Data was CA CTO Donald Ferguson’s notion that big data represents a “very promising” opportunity for startups, particularly those targeting specific target use cases. I think he’s right, particularly with regard to the latter part.

Is Big Data Making Us Dumber?

As organizations strive to analyze more data than ever and to do it faster than ever, the results they’re getting might actually be worse than those in the pre-big-data and real-time world — at least temporarily.

Tap11 Tries to Tame the Twitter Data Firehose

When it comes to social data, one of the biggest firehoses around is the one that comes from Twitter. Trying to make sense of 140 million tweets a day in something close to real-time is a significant challenge, says Tap11 chief technology officer Braxton Woodham.

Hadoop Proving Itself for Targeted Video Ads

Using Hadoop to process data for targeted web advertising efforts is nothing new, but this week, two companies in the video advertising space also stepped forward to highlight how Hadoop is helping them deliver the right ads to the right viewers for their clients.

Yahoo Suggests MapReduce Overhaul to Improve Hadoop Performance

Just over than a month after discontinuing its Hadoop distribution to focus on the flagship Apache Hadoop project, Yahoo is proposing some changes to the Hadoop MapReduce component that could significantly improve processing performance. The proposal illustrates just how beneficial Yahoo’s renewed focus could be.

How Facebook Is Powering Real-Time Analytics

Facebook is working on a real-time analytics dashboard to let users determine which content is getting the most attention from visitors. As described in an educational session on Wednesday night in Facebook’s Seattle office, the service is built atop HBase and tracks about 100 metrics.

Are Big Data Startups Eyeing the Enterprise Already?

Two popular big data startups, Karmasphere and 10gen, made management changes this week, which might signal that the companies’ boards feel they’re poised to make runs at the big time and need seasoned leadership to take them to the next level.

How Vendors Are Lowering Big Data Barriers

It was a big week for big data, with two key trends adding fuel to claims that data management and analysis will never be the same. Even laggards will be tempted to give big data tools a try to see what all the hype is about.

Piccolo Project Tries to Speed Past Hadoop

Few would argue that Hadoop doesn’t have a bright future as a foundational element of big data stacks, but Piccolo, a new project out of New York University, is moving data in-memory in an attempt to improve parallel-processing performance beyond what Hadoop and/or MapReduce can do.

Why 2011 Will Be a Big Year for Big Data

With enterprise data volumes growing, business and IT leaders face significant opportunities and challenges from big data. The space, of course, is not without its obstacles — including plenty of privacy concerns — but in 2011, there are numerous sales-growth opportunities and new business models finally surfacing.

Why Yahoo Is Discontinuing Its Hadoop Distribution

Yahoo is ceasing development of its Yahoo Distribution of Hadoop and will be folding it back into the Apache Hadoop project. The company cites a goal “to make Apache Hadoop THE open source platform for big data” as a driving force behind its new strategy.

Jan. 19: What We’re Reading About the Cloud

Today’s links offer further proof that technologies like Hadoop and NoSQL aren’t going anywhere — and might even be expanding — and that choosing the right cloud computing solution really should be about what’s best for the individual business (e.g., public vs. private, or available vs. reliable).

Cloudera Puts Its Money Where Its Tech Is

Hadoop startup Cloudera has rounded out its support of the Apache Software Foundation by becoming a Silver-level sponsor. Cloudera already contributes code and personnel to the Apache Hadoop project and Cloudera’s Doug Cutting (and Hadoop creator) is the ASF chairman.

Why Hadoop-like ‘Dryad’ Could Be Microsoft’s Big Data Star

On Friday, Microsoft’s HPC division opened up the company’s Dryad parallel-processing technologies as a Community Technology Preview (CTP). Dryad could be a rousing success, in part because Hadoop — which is written in Java — is not ideally suited to run atop Windows or support .NET applications.

Dec. 15: What We’re Reading About the Cloud

There was much talk about cloud computing today, all of it hitting different aspects — from how IT organizations will adopt it to what makes a “niche” cloud to how AT&T’s spotty network helped drive the need for it. Hadoop and Cassandra news also caught my eye.

Dec. 13: What We’re Reading About the Cloud

Web infrastructure is a hot topic today, after Amazon Web Services experienced an outage over the weekend, and after Facebook released some interesting details about its Hadoop cluster on Friday. Even LinkedIn is making headlines by expanding into a new Los Angeles data center.

Dec. 2: What We’re Reading About the Cloud

Chalk another one (two, actually) up for Hadoop. Among the big news today is Apple stepping up its Hadoop development efforts, and Datameer targeting social-gaming companies for its Hadoop-powered spreadsheet application. Elsewhere, data center spending is still high, and IBM is looking to revolutionize high-end processors.

Why Nobody’s Searching for ‘Big Data’

Matthew Aslett at The 451 Group posted some Google Trends graphs showing that searches for “Hadoop” far exceed searches for “big data.” I ran some of my own to dig deeper. Users, it seems, are just concerned with tools to help them ride the big data wave.

Nov. 2: What We’re Reading About the Cloud

It’s not always good news with cloud computing, and we saw that today with someone calling out Enomaly’s new SpotCloud, someone else detailing the difficulties of developing a mobile app in Windows Azure, and the Cloudscaling boss calling out the traditional definition of cloud computing.

Big Data and NoSQL March to the Enterprise

It was a big year for NoSQL and big data, but now those vendors need to buckle down on their revenue models and make a head-on charge to the enterprise. Because, let’s face it; while the web leads the innovation, the enterprise leads the economy.

Cloudera Raises $25 Million in New VC Money

Hadoop startup Cloudera has raised another $25 million, bringing its total funding to $36 million. The new funding bolsters Cloudera’s position as the hub of the commercial Hadoop world, and the belief that Hadoop will become the centerpiece of many Big Data efforts.

Hadoop World: Cloudera Makes More Big Data Friends

Hadoop World is taking place today, and, indicative of the general momentum around Hadoop, there is plenty of news coming from the event. As one should expect, Cloudera is driving the action, but it brings vendors and service providers of all stripes into the mix.

Hadoop: From Open Source Project to Big Data Ecosystem

The Hadoop hoopla is generating increasing numbers of announcements from more and more vendors. From startups to large established players, new products and partnerships are emerging which confirm the emergence of a vibrant Apache Hadoop. Hall explains the three emerging layers in the “Hadoop stack.”

Survey: Hadoop is Great, but Challenges Remain

Commercial Hadoop startup Karmasphere today released the results of a survey of 102 Hadoop developers regarding adoption, use and future plans. The results provide some interesting insights into how Hadoop grows within organizations and underscore its status as an extremely valuable, but none-too-simple analytics tool.

Cloudera: All Your Big Data Are Belong to Us

As Big Data gathers steam within the consumer web, Cloudera is making it possible for mainstream IT to tap into this trend through its distribution of Hadoop, suggested by the company’s customer growth. Lower costs and improved ease-of-use are making Hadoop a reality for enterprise.

Why a DIY Big Data Stack Is a Better Option

While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building a proprietary solution.

What We’re Reading About the Cloud: August 19

Hadoop, the big data analytics software is so hot right now. Heck anything big data is so hot right now. Today’s links offer insights to Hadoop alternatives, how to use Hadoop and an endorsement of Microsoft’s platform as a service strategy.

Meet the Big Data Equivalent of the LAMP Stack

Hadoop, thanks to the growing importance of Big Data Analytics is gaining traction inside the enterprise. What’s been missing for Big Data Analytics has been a LAMP-like stack. Fortunately, a stack for Big Data aggregation, processing and analytics is on its way.

The Incredible, Growing, Commercial Hadoop Market

A few months ago, I posited that additional funding for Cloudera and Karmasphere signifies a large market opportunity for solutions that utilize the open-source analytics tool Hadoop. From the news generated this week by Yahoo’s third annual Hadoop Summit, my beliefs of this have only been affirmed.

Why Hadoop Users Shouldn't Fear Google's New MapReduce Patent

Google, nearly six years since it first applied for it, has finally received a patent for its MapReduce parallel programming model. The question now is how this will affect the various products and projects that utilize MapReduce, such as Apache’s MapReduce-inspired Hadoop project.

Yahoo Touts Its Tweaked Version of Hadoop

At the Hadoop Summit in Silicon Valley today, Yahoo (s yhoo) announced the availability of the Yahoo Distribution of Hadoop, a source-only…

Microsoft, Now Loving Hadoop

Last week, OStatic noted the rumor, first reported by VentureBeat, that Microsoft intended to buy Silicon Valley semantic search engine Powerset for…

The Cloud Opens Up

We are only ten days away from Structure’08, our web infrastructure conference. As part of our preparation for this event, our team…