Pivotal, the big data and cloud computing company that spun out of EMC and VMware in 2013, is putting its financial and engineering muscle behind an open source project called Tachyon. Developed at AMPLab, the same University of California, Berkeley group responsible for the wildly popular Apache Spark data-processing framework, Tachyon is a distributed in-memory file system that lets multiple applications access the same data at the same high speed.
In a blog post published Tuesday, Pivotal explained its interest in Tachyon as part of the “data lake” strategy the company is pushing, whereby companies would store all their data using Pivotal’s Hadoop technology and access it using a variety of tools layer on top:
In partnership with the AMPLab at UC Berkeley, Pivotal envisions this future architecture will incorporate an in-memory data exchange platform based on Tachyon and in-memory compute layer augmented by Apache Spark.
The result is a next-generation date lake implementation based on Spark and Tachyon, which Pivotal is referring to as a “butterfly architecture.” Within this model, Tachyon provides an efficient memory-centric caching layer for disparate data sources, and allows the tracking of data lineage, independent of the computation framework. It will serve as an efficient memory-based data exchange layer within the data lake, and is pluggable, enabling existing storage and processing systems to co-exist with the new framework.
The post also notes EMC’s interest in Tachyon, citing work being done to integrate it with the DSSD flash-storage technology [company]EMC[/company] acquired in May, as well as the company’s Isilon file system technology.corporate
As I noted in an August feature on the big data tools coming out of AMPLab, co-director (and Databricks co-founder and CEO) Ion Stoica is especially excited about the prospects for Tachyon, which should become an Apache Software Foundation project soon. Not only will Tachyon help current frameworks such as Spark, MapReduce and SQL engines run faster and share data, but it’s also the foundation of some interesting new tools.
One of those tools, a NoSQL project called Succinct, is designed to run relatively complex queries on compressed data without first decompressing it and without requiring secondary indexes. Rachit Agarwal, an AMPLab researcher leading the Succinct project, told me then, “What you could do previously with 1,000 machines, Succinct allows you to do in 100 machines.”