Today's leading minds talk Data Storage with Host Enrico Signoretti
Previous to founding VAST, Renen was the VP of R&D of XTremIO… the fastest growing IT product in history. In 3 short years, XTremIO went from GA to $3B of global sales of systems powering mission critical database and virtualization applications. Renen joined XTremIO at its inception as the first engineer in and left Dell EMC having managed a team of over 300 engineers. Renen has a masters in computer science with a specialization in computational complexity – he is based out of VAST Data’s NYC HQ.
Enrico Signoretti: Welcome everybody. This is Voices in Data Storage brought to you by GigaOm. I'm your host Enrico Signoretti and today we'll talk about innovative storage architectures made possible by new technologies. Usually when new technologies are available to vendors and end users, nobody's able to take full advantage of it from day one, and data storage is no exception. At the beginning everybody goes for the low hanging fruit, meaning they just put these new devices... this new technology in the system and they get what they give them. Then there are several iterations and optimization tuning that is a form of tweaking, and we get better performance, -- better use of these technologies.
But the real benefit can be exploited only when you start from scratch with a radical and not opinionated new architecture design, especially if you work around, and for these new technologies. My guest for this episode is Renen Hallak, CEO and founder of VAST Data, an innovative storage startup that exited from very recently at Storage Field Day 18. At this event they presented the company and the architecture of the product. They made a lot of bold choices and there is nothing that you can call ‘traditional’ in their product. But this is also the biggest differentiator of VAST Data, which promises to bring the capacity and cost of our price associated with the performance of flash memory. Pure utopia or did they really manage to do it? Let's discover it together. Hi Renen how are you?
Renen Hallak: Very good. Thank you for having me.
So thank you again for joining me today. And maybe we could start with a brief introduction of yourself and your company?
Sure. My name is Renen Hallak, as you said, founder and CEO of VAST Data. The company is a young one, -- we started three years ago. The idea was a little bit before then but the company was founded in the beginning of 2016. Basically we're trying to redo storage. I'll give you a little bit about the company and then we'll dive right into the technology. So the company today is 80 people headquartered in New York. That's where we have sales and marketing; operations and support is in the Bay Area. And we are at the stage where we've just started selling product. We went GA [at the] end of 2018 [and] a year before that we had our alpha release. So we've had a long time in the field with customers. And before that it was a year, to a year and a half of architecting and building and understanding what it is that we need to build.
Great. So let's dive into the technology. I was very impressed by your presentation at Storage Day 18. You made a lot of bold choices for the design of your storage system and you also promise this performance at our disk drive level prices. Let's start with a brief description of the performance and capacity you can bring to the table, -- and for which use cases.
Sure, so basically we're trying to break tradeoffs. Make it easy for the customer not needing to make any compromises with respect to storage. What we bring to the table is a single storage system that comes in at a price point that is on par with today's lowest cost Tier 5 hard drive- based systems, while not compromising on performance. Our performance is on par with today's Tier 1 all flash storage systems and even a little bit better because we're not all flash. We're 99% flash and 1% 3-D cross point.
So you get even better performance than you would expect from all flash systems, -- all of this at a capacity level that can fit the entire storage pyramid, basically eliminating the need for storage tiering and allowing the consolidation of different workloads onto a single storage system. We do this in a very dense form factor: 1 petabyte per rack unit, and what we say is if you have a single storage system that is big enough and cheap enough to hold all of your data and fast enough such that you're guaranteed sub-millisecond latencies to the entirety of your data set, why put your data on anything else?
There is a lot to take in here. So you're telling [us] that you can support a lot of workloads and bring them to the table, -- both the capacity as well as performance and again with very low prices. This is a lot to take in so quickly, so let's dive in a little bit more on how you do it.
First of all, I started with... talking about new challenges posed by new technology adoption, and in the last few years we have seen storage evolving radically. It once was the bottleneck for every infrastructure, but now with the new devices coming in and latencies that are in the microseconds level and so on. We're starting to see CPU and networks saturated. SLC was used mostly like a Tier 0 or caching and then an MLC became our Tier 1, and then we add all the iterations down to pure LC and this brought the price down. But actually we got a lot of other issues with privacy, which is also endurance for example and performance, which is not that great. How did you overcome all these limitations of QLC, -- because you are thinking about the consumer grade QLC , right?
Yes. So QLC and low cost flash in general is a big part of our story. It is one of the enablers of us to reach the price points that we need. It's not the only one but it's a key one. In order for us to use QLC, we had to re-architect the system around it. If you were to take QLC drives with their low endurance and their poor write performance today and put them in any other ‘all flash’ system, they would wear out very, very quickly and that is because of something known as write amplification.
When you write to a drive, it garbage collects internally and moves data around, such that the NAND gets more writes than you thought you wrote to it and all existing, all flash storage systems suffer from this. What you need to do in order to use QLC is a few things. In order to overcome the write performance limitations which are mainly around inconsistent latencies and the ability to write for multiple streams at the same time. What we do is we use 3D crosspoint Intel new NVRAM technology and we leverage it intensively both to overcome the deficiencies of the low cost flash and to enable new types of algorithms and metadata structures that allow us to build a brand new architecture and allow us to bring the effective price of storage even lower than the price of QLC that we need to pay. In order to overcome... the way we do it is by always writing to cross point from the application, acknowledging that right before we actually migrate the data off to low cost flash, and then in the background after we've had time to order the data into exact erase blocks, after we've had time to understand when this data is expected to be overwritten. Only then do we actually migrate it to QLC flash using a single stream and perfect erase blocks. Why do perfectly erased blocks make a difference? Because doing so means that the drive doesn't have any write amplification internally, and at our level we can make sure that we don't move data around because we have better insight into the data and the data write patterns.
You're saying two important things: one is you are optimizing the writes and you get the performance from that obtained memory. OK. But still we are far from reaching our drives price point which are four or five times lower. So there are other pieces of this technology that are important to make it work, right?
Yes. So our architecture is the one that allows us to do these new things. When you look at our architecture it's very different than existing scale out storage systems. In fact in many cases it's the opposite. Instead of having nodes that are responsible for a specific piece of the namespace we leverage NVMe over fabrics such that every one of our nodes can see all of the devices over an Ethernet network as if they were directly attached.
This gives us a global view and we leverage this global view in order to get economies of scale. What does that mean? The first piece of it is the use of QLC flash. The second piece of it is a new type of data protection system that doesn't waste a lot of space. The third piece of it is very, very aggressive data reduction that couldn't be done without this new architecture, couldn't be done without 3D cross point, and enables us to get much better reduction ratios on data that has been notorious for not being compressible and not being duplicatable.
Again, so let's analyze one of these points at a time. So what do you do for data protection? Because you told us many times that you don't use erasure coding to limit the number of writes, but actually up to now that was considered the best mechanism for data protection.
Yes. So we have a new way of doing data protection that leverages this global view that we have. In data protection historically, you always had to choose between two of three things. Those three things being: efficient capacity overhead (not wasting a lot of space), good resiliency (the ability to lose multiple drives at the same time without losing data) and rebuild speeds. How fast are your rebuilds? How long do you need to wait before you regain redundancy when a bad thing happens? What we try to do is break the tradeoff between all three.
And the reason is that we're using consumer grade flash [is]... we want to give our users the best resiliency that we possibly can. But the flash is more expensive than hard drives, and so we want to give our applications as much of the capacity as we can. Obviously without hurting rebuild speeds. How do we do it? We do very wide stripes. If you can imagine a rate six stripe with four data drives plus two redundancy drives, we do much wider. We start at 150 +4. So you can have four failures at the same time without losing any data, but you only pay about 3% in capacity overhead and as the system grows, this mechanism is scalable such that we grow the stripes with the system, and when we have more SSDs you can imagine 500 +10 or 985 +15.
The number of drives that can fail at the same time keeps growing such that you get much better resiliency, but the overhead keeps shrinking down to 2% or 1.5% such that you pay less for that resiliency.
Ok I see, so you are not paying that sort of penalty that usually you have with traditional data protection systems. OK. I mean capacity penalty, and but still even if you don't pay that 20% you reduce it to 2% or 3%. We need to get to that data footprint optimization to get to the right price. And you told us that you don't do compression and you don't do that application but you use a new technique.
Correct. So in most cases when you take into account TCO calculation total cost of ownership using the low cost flash and not wasting space using our data protection puts us close or at the price point of harddrive based systems. For example, this system is warranted for 10 years. You can keep it on the floor for a lot longer than you would hard drive based systems. We fit a petabyte in a rack unit. You need a lot less space, a lot less power, a lot less cooling. The system is very easy to maintain.
You don't need to go into the data center and replace parts because everything is ??fail in place and so the operational benefits bring us to a point where we are on par without anything further. But we want it to do better than that. We want it to be better on a TCA analysis. Total cost of acquisition and have all those operational benefits come as a bonus to our customers. For that what we did was design a brand new way of doing data reduction and when we started, we were told by everyone that the unstructured data sets that we're going after at least initially large data sets in the tens to hundreds of petabytes, sometimes exabytes, don't compress very well and don't deduplicate very well. And the reason for that is that the application is at a better spot to compress data because compression is a local process. It's done at the block or at the file and it goes all the way down to byte level granularity, which is the good part about compression, but if the application has already done it before it reached the storage system there's not a lot left for us to do.
The reason deduplication usually doesn't work on these data sets is because at tens and hundreds of petabytes you don't normally keep exact copies of your data. And while duplication is a global process that is best done at the storage level, it works in coarse granularity. You have to have full blocks that are identical in order to find something and you only get full blocks that are identical when you have exact copies of your data.
So what we did, which is very different, is we looked at the data itself and we saw that when looking at a global namespace you can find a lot of commonality, a lot of similarity that isn't being taken advantage of by compression because it's local and by deduplication because it doesn't go down to byte level granularity. The differences between blocks are very slight and they are being missed by deduplication because it's sensitive to noise.
What we did was instead of looking for identical blocks the way deduplication does, we look for similar blocks, -- which there are surprisingly many of in these data sets. And the advantage we have once finding those similar blocks is that we can compress them against each other, -- in effect doing a global compression scheme that goes all the way down to byte level granularity.
However when you want to read something, you don't need to decompress the entire namespace. You can only read the blocks that you actually want to read. Again this mechanism is made possible because of some new algorithms that we've developed namely a similarity hash function, but also because of the global view that NVMe over fabrics affords us and of 3D cross point that acts as a very large write buffer and has a very large metadata store that we use to keep our global dictionary.
Are you telling me that the worst case scenario for your data footprint optimization technique is that you are as efficient as the duplication but that you can do way better even in very scattered kind of data set with a lot of encryption also and compressive data?
Yes, it can be proven mathematically that what we do is a superset of the best compression and the best deduplication. So we will always be at least as good as those mechanisms, but more than that, we see that on data that is not deduplicatable and not compressible and in some cases already pre-compressed. We see very good data reduction ratios where those mechanisms would not see anything.
I can give one example of backup data and backup software [that] already compresses the data and already deduplicates it. So when we act as a backup target we get the data after those operations have completed. In many cases we see that we can still give 5:1 data reduction ratio, 7:1 data reduction ratios on top of those mechanisms which is a testament to the fact that there is more similarity within the data than those mechanisms can take advantage of and that is our unfair advantage.
Yes, in fact this is impressive. And I suggest our listeners watch the Storage Feed videos because in your demonstration at the end you showed some of these with a couple of examples from your customers and they actually are very, very good.
So we talked a lot about that the technologies involved in your system, but just to recap a little bit: you have these storage nodes, they are in a H.A. configuration meaning two servers for each capacity tray (if I can't call it this way) and frontal nodes, then manage all the data protection and data reduction in a scaled out fashion. Did I miss something here?
No, that's exactly right. We disaggregate between logic and state. So we have these H.A. enclosures that hold all of the state a lot of low cost flash and a little bit of 3D cross point and give us NVMe over fabrics access to these devices, but don't have any of the logic in them, -- in our containers, which is the second part of the architecture. We hold all of the logic but they are completely stateless.
This allows a few very nice benefits to the customer. For example, you can grow capacity independently of performance. You can grow performance and shrink it dynamically by spinning up more containers and spinning them back down. We also leverage this new disaggregated architecture to build what we call a shared everything cluster where each of these containers can see all of the devices and they don't have a specific responsibility, so they can answer all of the applications' requests without talking to each other. This lends itself to much more massive scale than you would expect from traditional storage systems. We can scale this up to tens of thousands of containers and to up to 1000 of these H.A. enclosures, bringing us to very high performance levels on one hand and two exabytes level scale under a single namespace.
OK so you're telling me that this solution can start at one petabyte and can grow with up to exabytes, -- meaning that it could be for large enterprises as well as a web scalers, these kind of guys. So does it also mean that you are open to providing the hardware, but also to collaborating with your customer to take advantage of their hardware if they want to?
Yes. So at the smaller scale, we find that most of our customers want to get an appliance, buy all of the hardware from us, have our containers on servers that we sell them and that is perfectly fine. We are happy to do that all day long. The larger deployments we see that customers usually already have some type of container infrastructure in their data center, some type of kubernetes orchestration layer, and they are at that point happy to just buy DHA enclosures from us which have all of the unique hardware pieces: the low cost flash, the 3D cross point, the NVMe over fabrics enabled nicks?? and run our containers on their existing compute infrastructure. In fact the only thing we require then is for those compute nodes to have access into the H.A. enclosures, either Ethernet access or infiniband access.
And so for the mid-range, we find that most of them would like to utilize their existing compute infrastructure and not buy servers from us, not buy switches from us. On the very extreme, in cases where we are dealing with hyperscalers that have exabytes of data, we are in conversations about a software-only solution, although I must admit we have not deployed one yet.
OK so and we're talking about the output, right? Actually all the magic at best at the software level in your system. So we didn't mention how you organized the data into the can, we know now that you use as a landing at this stage for everything that optimum memory. And then when it's organized in the proper way it flushes down on the accuracy. In a way that is as balanced as possible, but we didn't mention the protocol you expose and how you organize the data in the backend, at least at the eye level.
Yes. So the fact that we are designing a new system for low cost flash and 3D cross point affords us the ability to start over, -- and the fact that we're building this in a new type of architecture begs us to build new types of metadata structures. I'll give a couple of examples: for example all of our containers can access the same cross point metadata layer at the same time. If we weren't to build a new type of metadata structure, we would run into contention issues. We would run into consistency issues.
And so all of our metadata is built such that every single container can read it at the same time as everyone else without taking any locks and they can do this at the same time someone is writing to the data without dealing with any consistency issues. Basically it's a tree, a very wide fan out tree that allows us to update it using atomic operations. And so each container can crash between any two lines of code and nobody cares.
This shared metadata structure remains consistent. It also gives us a very unique advantage from the point of file systems and object stores. It used to be that file systems had limitations and at some scale you had to move over to an object store and at some performance requirement you had to move over to a block device. We for the first time can give a file system at the scale of object and at the performance of block. And the reason is because of this new architecture, because of this new media type that we are leveraging.
So you can have the rectory structures in whichever way you would like -- very wide, very deep. You can have files that are very small, -- trillions of them, or a few very large ones multiple petabytes in size and our metadata structures always look the same and from a protocol perspective. Today we exposed two protocols NFS and S3, file and object because those are the protocols that our early access customers are asking us for. But the nice thing is that internally we've abstracted our own protocol and we can put these external protocols on top of that abstraction side by side.
So all of the protocols that we have today and all of the ones that we will add in the future are all native to this abstraction. So they all get the same performance levels. Also they can all access the same data, so you can write something through NFS and read it through S3 or vice versa and this will remain true for all of the protocols that we will add in the future.
Very good, and now I have a tough question. Storage guys are usually very conservative people and the architecture several times to your presentation and let's say that I'm convinced that you are doing it right. What do these guys think about the first time you meet with them? I mean it's too good to be true, okay? And somehow it's difficult to understand at the beginning. It's difficult to to keep all these things together.
Yes. So customers usually don't believe the story the first time they hear it and it takes us a while of explaining how it's done and why it can be done today and couldn't be done in the past before they realize that maybe this is possible. But the real proof comes when they start testing it. And we've had multiple dozens of customers over the last year and a half starting with our alpha and then with our beta and now in GA that have tested the product and they have tried to break it and they are always surprised with how resilient it is.
The reason is both that we have a very experienced team that has built many storage systems in the past, but more than that it's an architecture that lends itself to simplicity, and to resilience. And so it's very hard to break and customers are seeing that through PoCs and through testing engagements that we have. In fact today we have a dozen customers that are already in production that have purchased the system and we're growing that very quickly.
Is it still a US market that you are looking for or are you looking more to a worldwide adoption?
So we are focused on the US to start. The US is both the biggest market and consists of most of the early adopters. But surprisingly we have had inquiries from Europe from the Far East, of companies that have heard about us and want to adopt us as well. So I believe that by the end of 2019 we will start looking beyond the US.
Very good. I think we can wrap up the episode now. We got a lot of insight on your architecture on your product. Maybe you can share with us a few links about your website and where we can find more information about VAST Data, and maybe if you have a Twitter account, you can share also that one, so people can continue the discussion online.
Of course, so our website is: www.vastdata.com. We also have a Twitter account under the same name, as well as on LinkedIn, and as you mentioned, we were on Storage Field Day last week, and we gave a deep dive into the architecture, as well as a full session about early access customers and use cases that we're seeing a lot of success in, so those videos are part of Tech Field Day and can be found on YouTube.
Okay Renen, that was a great episode, thank you very much again for spending the time with me today, and bye bye.
Thank you it was my pleasure.