Today's leading minds talk Data Storage with Host Enrico Signoretti
Charles Fan is co-founder and CEO of MemVerge. Prior to MemVerge, Charles was the CTO of Cheetah Mobile leading its global technology teams, and an SVP/GM at VMware, founding the storage business unit that developed the Virtual SAN product. Charles also worked at EMC and was the founder of the EMC China R&D Center. Charles joined EMC via the acquisition of Rainfinity, where he was a co-founder and CTO.
Charles received his Ph.D. and M.S. in Electrical Engineering from the California Institute of Technology, and his B.E. in Electrical Engineering from the Cooper Union.
Enrico Signoretti: Welcome, everybody! This is Voices in Data Storage brought to you by GigaOm. I am Enrico Signoretti and today we will talk about in-memory storage.
You probably already know, last year Intel presented a new product called Intel Optane. This is something that sits in the middle between traditional RAM and NAND. It’s cheaper than RAM and faster than NAND, so it fits exactly in the middle, between the two. This opened up a lot of opportunities, both for storage vendors as well as for developers. There are several form factors of this new type of media. It is accessible as a traditional SSD, so the Nvidia interface. It was also launched as a DIMM, so you can install it in your server.
Of course, there are a lot of implications about this new form factor and today to discuss about it, I invited Charles Fan, CEO and co-founder of MemVerge. Hi Charles! How are you?
Charles Fan: Hi, Enrico! How are you? Happy to be here.
MemVerge is one of the few startups that [is] now working on building an in-memory storage system. In-memory storage, it’s something that is really compelling. It’s something that we started mentioning a few years back, but actually the costs of putting too much RAM in a single system was a little bit too much to make these systems available to the general public. Maybe Intel Optane will change a little bit of this thing.
Before continuing in this conversation, maybe, Charles, you can give us a little bit of background about you and MemVerge?
Sure. I’m a co-founder and CEO of MemVerge, which is a software startup, headquartered in the Silicon Valley. We started about two and a half years ago, really on the thesis of a memory-centric data center and how storage-class memory can be an enabler to make that happen. Intel introduced their actually first 3D cross-point based product two years ago, as an SSD, as Enrico you just mentioned. They introduced a memory form factor, the DIMM form factor, earlier this year in Q2. Our software is built on top of this second product, this memory product. In fact, to take full advantage of the benefit it has, number one, it is much faster than the SSD, by a couple of orders of magnitude. It is really at a nanosecond scale in terms of latency, compared to tens or hundreds of microseconds for NAND flash SSDs.
Secondly, it’s bigger. You can have up to 3 terabytes per socket with the ‘generation one’ of this new memory; meaning for a two-socket server, you can have 6 terabytes of this memory available. We are basically taking advantage of both these benefits: bigger byte addressable memory that is persistent and faster than storage itself. We are creating a virtualized, distributed software platform on top of it so that the existing applications, as well as new applications, can enjoy much bigger memory, much faster storage. At the same time, they do not have to change their existing application. Our memory virtualization platform essentially allowed that to happen. That’s, in a nutshell, what we do.
Just to recap a little bit about Intel Optane. Essentially, you can access it in two ways: one is memory mode and the other one is application mode. Can you give us some information about how it works? What are the benefits of one system and the other?
This is a pretty revolutionary, or disruptive, technology in terms of being a persistent memory. Memor[ies], as we know, are volatile, meaning you can allocate your data structure variable on to the memory, but if you power the machine off, all the data are lost. To program a persistent memory, it can survive a power recycle. If you reboot your computer, if you somehow lost power to your computer and turned it on again, your data is still there, but the application will still need to find your data. To allocate your data structure on to a persistent memory, you need to remember where it is after a power recycle, which means you need to have some kind of a name, which allows you to find where your pointer is at and have the right offset to that pointer.
To program in its native mode, and what Intel calls it App Direct or application direct mode, essentially it requires a new API for a developer to program to, which allows these memory pointers to be stateful so that you can remember where it is after a power reboot. This is very powerful. If you design your application this way, you can potentially make the application faster and easier, but it’s also, for existing applications, not the easiest to use. For that reason, Intel introduced the second mode, which it calls memory mode.
Memory mode is a backward-compatible mode to volatile memory or to DRAM. In this mode, the operating system and the application sees persistent memory just like DRAM. It will have no persistence feature and you can just use regular malloc to use this persistent memory. Existing applications will just work, except it sees a bigger memory. This memory is slightly slower than DRAM but it’s much bigger. The bigger memory applications can run better on this mode. The downside of a memory mode is it doesn’t have the persistence capability that the underlying hardware has.
Got it. In memory mode, you still have some memory, some RAM in the system, so how do they interact?
If you’re using the Intel memory mode, it employs an algorithm it calls 2LM, or 2-level memory, which is essentially a memory caching algorithm that Intel implemented in their controller. It can use DRAM as a caching tier and persistent memory as really the memory tier, to create an overall volatile memory service. It recommends a 1:4 ratio between DRAM and persistent memory. For example, if you have, say, 1.5 terabytes of persistent memory, then it will recommend something like 384 gigabytes of DRAM to also be in the system. Those will be grouped together through this 2LM mechanism and available in memory mode as 1.5 terabytes of combined volatile memory available to the applications.
You just mentioned ‘memory controller.’ Does it mean that I need the latest version of Intel CPUs to make it work because the memory controller is usually something that comes with the CPU?
The first answer is, yes. In fact, to support the Optane ECU persistent memory, you do need the latest CPU from Intel that they ship also in second quarter. The code name is Cascade Lake, or it’s Scalable Processor 2 or something, the official marketing name. This CPU or the later CPUs will support this new memory.
The second part of the answer is the memory controller I referred to was actually referring to the controller chip of the DIMM card itself. What comes with Optane is this DIMM card. Outside the card, it has a controller. It also has the media, which is 3D Crosspoint. The actual algorithms are being implemented right on the DIMM card.
The reason you need a new CPU is in order to support this new memory, Intel implemented a new variant of the DDR memory protocol it calls DDRT, which allows the DDR – which allow the DIMM slots to support memory of different frequencies, for them to coexist peacefully together. You can have DRAM, which is at a different frequency as the Optane memory to be plugged into the same memory channels that’s connected to the CPU through DDRT. This new CPU supports DDRT and all future CPUs will support DDRT; that’s the reason you need a new CPU.
How do you implement Intel Optane in your system?
That’s a great question. Intel, I think, [is] doing some ground-breaking revolutionary work with this new hardware. The question is ‘What do we do?’ Most immediately, we are developing a software we call Distributed Memory Objects. What we do is essentially a virtualizer of the persistent memory. We access the persistent memory on these servers in app direct mode, in its native mode so that we can take advantage of both the persistence and the speed of the byte addressability of this new media. What we do is we present the memory and the storage interface to the applications that are both backward-compatible to existing interfaces.
We are like a translator. We are speaking this new language of persistent memory and we are translating to the existing language of volatile memory and persistent storage at the same time. Those applications can take advantage of both bigger memory and faster storage, through these interfaces, without rewriting or recompiling their application. They can be deployed as is. We basically serve as a translator, to allow existing applications to adopt this new technology in the easiest way possible.
You are this translation layer, okay. Now I have two questions: one is about redundancy and the other one is about the fact that you said you can expose different protocols on top, so it’s a shared storage.
Maybe let me answer the second question first and then I’ll come back to the replication – the redundancy question. We are not just a storage solution that we offer. We are different from existing storage solutions in that we are a converged solution of memory and storage. We offer both memory API and storage API. We are offering that not just on a single node but on a cluster, a scaled-out cluster of servers that we can scale up to 128 nodes.
On the memory side, we have two methods how our application can use us. We can either use the overloading method by intercepting the malloc functions of the applications, so we can handle the memory allocation and deallocation for the applications, on top of the persistent memory and DRAM. Or, if the users have control over their application, they can also link to our library and call our malloc function directly.
On the storage side, similarly we can offer SDK for the application in cloud or storage, read/write memory map files function directly, or we offer another two storage APIs. One is a locally mountable file system so that they can see us just like a local attached storage. Anything that can write to a local file system can write to us, even though behind this local file system is actually our distributor file system across multiple nodes, on top of the Optane memory. In addition to this, we also support an HDFS-compatible API. For all the applications that are using HDFS, they can use the protocol to access us directly as well. These are the methods where the applications can access us, either in memory or in storage.
To answer your second question, that we do not protect data for our memory service. The reason is we want to optimize our performance side and we want to try not to slow down applications when the applications are running through us and the memory. We are actually just handling the allocation part of the memory. We are not really on the data path, on the actual load-store operation of the memory, so we do not replicate or have redundancy of data across those memories, other than what the applications do themselves. On the storage side, we do support failure resiliency and redundancy of data across the media of different nodes so that we can recover gracefully from the failures.
I see. There is a lot to digest here. You’re talking about, at the end of the day, a massive cluster, 128 nodes that share a massive amount of memory at the end of the day. Applications can access memory c’s at local actually, but actually it’s a distributed married cluster. Or you can use the same memory, the same cluster, to provide storage services and they can mount them as a local file system, as HDFS, or probably you can put on top NFS, if you use this to slide and things like that. I don’t know if it makes sense from a performance perspective, but actually, theoretically, you can do everything you want because you have Linux machines, I suppose, where you’re installed.
This is really, really big at the end, so you can really change the way these kinds of clusters are designed. The question that comes immediately after [is]: What are the best use cases? I’m pretty sure that big data analytics is the first one.
Yes, you are absolutely right. In fact, big data analytics are some of the first use cases that we explored with our alpha customers and beta customers, so are the machine learning and AI applications. We’ve been working with a number of customers on those. In short, to judge what is the best use case, using any data-centric application that either needs a bigger memory or faster IO, are probably good candidates to be a good use case for us.
Most recently, in the last few months, after Intel shipped the actual hardware, we have found the biggest attraction with financial services customers, especially with the low latency, high performance infrastructure, processing market data and, in support, the trading applications.
Those low latency and high performance infrastructure is where constantly the applications are pushing the boundary on the amount of memory it needs and how low the latency they would like the infrastructure to be. Our solution offers [an] order of magnitude improvement over the status quo – what’s available between DRAM and NAND flash today. I think with our system sitting on top of the storage class memory that Intel pioneered, really offers the next generation of infrastructure. For example, puffs up applications for financial data, for market data replay, and for the quantitative model back-testing. For these use cases, we found a very good sweet spot for this solution.
What about the networking? Do you need specialized hardware or just any RoCE-enabled switches [will] work?
Right now, we support best RoCE-RDMA, over a converged ethernet. Typically, we do recommend 100 gigabit network, RDMA-RoCE network between the servers, but we have also worked with 40 gigabit and 25 gigabit RoCE network. We are working to support InfiniBand and the solar flare-type of network as well. If you are only using our solution as local memory and storage, but not remote memory, then we can support 10 gig ethernet as well. We have a mode that can work with just a 10 gig ethernet also. To have a full feature of the product, we do recommend RDMA networking.
Looking a little bit at the architecture and how it works, 10 gigabits doesn’t look [like] that much.
Right. That was just for some particular legacy environment, but to really take full advantage of a storage cost memory and the power of this memory converged infrastructure that we deliver, an RDMA would be the best networking for it.
You were talking about the first customers. How many customers do you have now for this product?
We have about a dozen beta customers today. We are right now actively in beta testing. We are certainly open to new beta customers to sign up. Certainly if you are interested, any listeners, just email to [email protected] and we will look to sign you up into the beta program. Among the dozen beta customers, we have customers in the financial services, as I mentioned; they’re banks, hedge funds, exchanges.
We also have customers who are internet service providers, or even cloud service providers, who typically have the big data or the machine-learning use cases in their environment. We also have some high performance computing and AI companies, as our beta customers. Those are our initial sets of customers.
Do you have any benchmark or maybe real-world comparison between an old configuration and a MemVerge configuration?
Yes, we do, and these are – the measurement that we got from our actual beta customers – for example, we have a use case in recommendation engine with distributed training TensorFlow framework. Before us, the customers are using the combination of DRAM and SSD on those servers and HDFS as the permanent store for this distributed job. Now replacing us, we’re basically replacing both the local storage and the HDFS and we extend the memory of such systems. We can improve the convergence of training of this recommendation engine by up to six times. This is an end-to-end training time. We have significant savings.
We can also enable better flow tolerance of this distributed training by enabling check-pointing of the at-memory state on to persistence in a more efficient manner, so that they can be turned off at more frequent intervals. This is because [of] the superior memory-to-storage speed on a persistent memory system. Essentially, we have a speed-up on the training, especially for models such as recommendation engine, which tends to use a lot of memory space. We can really enable better fault tolerance by more frequent checkpoints.
You are going GA [general availability] soon I imagine because we met the first time a few months ago. What would happen next? What can we expect from MemVerge in the next version of the product? What is your road map?
We are planning to GA our product in March 2020, so stay tuned for our launch announcement, when we can unveil more details behind our product and sharing more customer use cases. We have a very busy road map ahead of us. Maybe let me first share the vision we have of where we think the end goal is for our solution.
In our vision, the future data center will be, number one, more software-defined. Whether you’re talking about public cloud or private cloud, the infrastructure will be more [fully] delivered through software, so it is more composable and more flexible to be used by the applications. Not only that, we also believe the future data center will be more memory-centric, meaning with the emergence of storage-class memory as enabler. We believe the memory tier will be much more powerful than the memory tier we know today.
By the combination of DRAM and the new storage-class memory, we will see the scale of the memory tier at 100 times bigger than the memory tier today. Natively, it will have persistence capabilities. With the combination of software such as the ones we deliver, it can potentially make primary storage disappear; meaning the persistence capability for data can be enabled from the memory tiers themselves. We believe this is a fundamental and a big change to the data center, with the disappearance of performance-driven and primary tier of storage. We think our mission is to create the software tier necessary on top of this memory infrastructure, to make that happen.
That’s the long-term vision we have, which is a software-defined, memory-centric data center across the public cloud and the private cloud and the disappearance of storage.
Okay. Let’s say this is a very bold vision but somehow, I agree. We will have more and more capacity also to be developed on the object storage and a very fast year for interaction with data. Probably [this] would be more and more the case in the future.
Exactly. I think we are speaking on the same frequency here. Then our road map is obviously a journey, a pathway towards this vision, which I think is going to take the next five to ten years to realize. The first version that we’re going to GA next March will be just a first small step towards this vision. Now, even the first step will be quite powerful. As you mentioned, we’re going to have a memory infrastructure that can scale to 768 terabytes of memory.
We also support a second tier for the storage service that’s SSD, so we can support as low-cost as QLC SSDs. That means a 128 node cluster. We can have storage service, spending over 40 petabytes of storage. While this is fast, this also can be very big, so it can really handle the case of big and fast, where you can have petabytes of storage at nanosecond latency available to the applications.
On the road map, I think we’re going to gradually add in more data services available, on top of this data, on this journey to make primary storage less and less necessary. Also, we’re going to support more hardware, as Intel introduces the future generation of the Optane memory. We believe there will be other memory vendors who will be joining this market and delivering other storage-class memories and we will be supporting them as well. In fact that’s going to increase the value we offer through this virtualization layer. Essentially, we can make it abstract to the applications, whichever vendors are supplying the underlying memory as well.
We’re going to make progress on the software data services, on the hardware support, and also to improve the memory services we deliver; and potentially creating better APIs to support a future generation of software that can be optimized for this memory-centric infrastructure.
I see. That’s again, a bold vision, potentially great road map. I will keep an eye on the development of this platform because it sounds really, really cool. At the same time, I would like to share with our listeners a few links so that they can dig a little bit more in your technology and what you do. Maybe you can share with us your social media handles and also the website for MemVerge.
Great. If you are interested to learn more, come to our website www.memverge.com or come to our Twitter with the handle, @memverge. Also LinkedIn: you will find MemVerge there as well. Look forward to sharing with you more about what we do.
Fantastic, Charles. That was a very great conversation and that’s all for today. Bye-bye.