Amazon’s newly launched Data Pipeline will help Amazon Web Services customers get a better grip on handling their data scattered throughout the various AWS data repositories and third-party databases, Amazon CTO Werner Vogels said Thursday.
This tool will make it easy for AWS customers to create automated and scheduled workflows — from DynamoDB and S3 storage to Elastic MapReduce, wherever they’re needed. “It’s pre-integrated with AWS data sources and easily connected to third-party and on-premise data sources,” Vogels said.
The proliferation of data — machine logs, sensor data and plain old database data — is driving the need for automating the flow of that data from databases to storage to applications and back. “You have to put everything in logs which creates even more data … in AWS,” Vogels said.
Users build their workflows with a drag-and-drop interface and schedule them to run periodically. By making it easy to consolidate data in one place, customers will be better able to run big batch analytics on their logs and other information.
There was not a ton of details other than that, but from AWS track record, the service should be available soon. Stay tuned for updates.


This is a great step forward for those who store Big Data sets in the Cloud. It will reduce the administration required to do periodic batch runs of data sets, for sure.
But is Amazon looking at ways to offer in-memory services that would reduce the need for batch schedules? That would be an interesting proposition.
Actually, workflows don’t need to touch EMR. If you have in-memory DBs in EC2, or are using Redshift, for example, you can move data there too. Or they could go from EMR to an in-memory system. It’s really whatever flow you want, I’ve been told.
As in-memory data grids become the backbone of next-generation on-line applications, their dependency on any specific data storage technologies will become less relevant. AWS’s Data Pipeline Service could cross the divide between local/cloud data and allow HDFS to become the consolidated data storage platform of choice.
More on this here: http://mark.chmarny.com/2012/11/hdfs-has-won-now-de-facto-standard-for.html
Whoa, now that’s something I can use – data pipeline, great idea.