Today's leading minds talk Data Storage with Host Enrico Signoretti
Chadd Kenney serves as the Vice President of Product and Solutions at Pure Storage, where he leads product and solutions strategy, messaging, and solutions engineering. Previously Chadd was the VP & Chief Technology Officer for the Americas, driving the long-term vision and technical strategy of Pure Storage. Chadd joined Pure Storage in 2012 as an early founding member of the technical staff, driving the development of the go-to-market strategy and creation of the sales force that disrupted the storage industry. Before Pure Storage, Chadd spent 8 years at EMC in various roles including Field Chief Technical Officer, Principal Engineer, and Project Management.
Enrico Signoretti: Welcome to a new episode of Voices in Data Storage, brought to you by GigaOm. I'm your host Enrico Signoretti, and today we will talk about NVMe and why it is so important for the storage industry, what it brings to the table, how to take advantage of it. It's showing the next generation's shared storage architectural designs and what we can expect from it in the next 12-18 months. To help me with this task, I asked Chadd Kenney, VP of Products and Solutions at PureStorage to join me today. This company is the market leader in enterprise storage and in his role, he has access both to end users and engineering teams at Pure. Hi Chadd, how are you today?
Chadd Kenney: I'm doing great, Enrico. Thank you so much for having me here today.
So I don't think that anybody needs to do a presentation about PURE, but maybe just for the few listeners that don't know you or your company, we can start with with it.
Awesome. So my name is Chadd Kenney. I head up product and solutions here at PURE Storage, but I've had a bunch of titles here at PURE over the seven years that I've been here: from CTO of Americas to ‘Chief Flash Geek,’ which was one of my favorite titles here. That'll tell you how early I started in the company. Our mission is effectively to change the customer experience overall with innovation and enable innovators to change the world with data, which sounds like kind of a crazy mission but it's a lot of fun to be part of. I'd say at the end of the day, we provide a bunch of different solutions that are all flash storage both on prem and also hybrid. We do a bunch of cloud data management and fun stuff to add business value to customers and reduce costs.
Very great, so let's start then. NVMe we can start with the basic definition. What is NVMe?
So NVMe stands for non volatile memory express, what it is, is it's a highly optimized interface specification and also storage protocol for next generation solid state devices. And this came into play to kind of get rid of the old SCSI that was developed way back in the day, kind of in the 80s and 90s time frame, on disk based architecture and really to optimize the protocol stack to be able to support these next generation devices, which were all flash media, and this would help really bring in this new world of all flash and new world of performance and lower latency.
Okay, but why is it so important for the storage industry?
So if you look at the industry as a whole, we started off kind of in a disk based architecture for the most part. Everything was very SCSI based and it was attached, whether it be a networking infrastructure or just internally within the system. This evolution occurred relatively quickly where we transitioned to flash I think faster than what most people expected.
And so what happened at the beginning was disk drives were pulled out and flash was put in and everyone just kind of, for quick replacement, kept with the same SCSI interfaces that they used with disk drives. As things progressed and flash became more predominant in the data center, there was a much bigger focus on figuring out how we optimize the stack to be able to truly leverage the capabilities of these devices, because SCSI became an inherent bottleneck within the system. And so why it's so important to the storage industry is: it's finally kind of taking away the old and bringing in the new of this ‘all flash’ arena.
So most of the advantages come from latency and parallelism that this interface has compared to the old SCSI. Do we have numbers to understand what it really means?
Yeah, I think an interesting way to look at it is, kind of the comparisons are tough to sometimes discern. I like to use analogies for a lot of stuff, but to give you high level numbers and the reason why this became a big bottleneck, if you look at SCSI, it came in a couple of different form factors. You have SAS and SATA. SATA had a single queue with 32 commands for that one queue, and then SAS had a little bit more: 254 commands for that single queue; and now in a disc world, a single queue was not a bad thing for it to be able to handle, but if you look at this world: imagine quickly you are going to a grocery store and you have a single [cashier] who's checking people out. Now when 40 people are shopping at that grocery store, it's not a big deal because not everybody checks out at the exact same time. So if the [cashier] can check out 32 people, let's say on a single line like in a normal SATA environment it makes it pretty easy for them to be able to handle the workload.
Applications though, have transitioned dramatically and so [have] much more concurrency, much more parallelism. I mean the density of applications has increased dramatically with the density of man. And so now imagine if 400,000 people show up to that same grocery store, that single [cashier] is in trouble and not able to be able to handle the workload, and so the only way to handle it is to add more and more [cashiers] to be able to check people out. And this was fundamentally what happened with disk back in the day where you were spindle bound.
So NVMe actually brought in a completely new queuing methodology that was much more attuned to flash as a whole. It went from instead of one queue, it went up to 64,000 queues and within each of those queues, instead of having 32 or 254 commands within that queue, you can have up to 64,000 commands per those 64K queues. It may sound like a lot, but today you know we're opening up this new protocol, which will hopefully last for many, many, many years, and with new kind of multicore processing systems, this becomes incredibly important to be able to enable this new level of concurrency and parallelism.
That's incredible, and what about latency? So what does the reduction of the stack bring to the table?
Great question. So latency actually has a bigger and broader impact both internal within the systems as well as over the network infrastructure. Let's start a little bit on the network. So you reduce the latency quite dramatically by reducing the overhead of the protocol stack, and you'll see even more so with RDMA and such, which we can talk about later.
Inside the array with more queues, you can satisfy the workload incredibly faster and so you're enabled to reduce the latency to access those devices itself; and then over the network the concurrency and parallelism allows you to be able to use more network utilization and more bandwidth, which pushes more data over the network as well.
Did the NVMe change the way you are designing your system? I mean you had this transition last year, so introducing NVMe in the backend, so maybe your engineers had to do something in that regard to take full advantage of NVMe.
So when we looked at this, we were trying to figure out what actually gives the best benefits to customers. And so you really have two different areas that you can focus in on from an engineering perspective. You have the internals within the array and you kind of have the externals within the array that you can optimize for.
Internal within the array there's actually a whole bunch of things to do. Remember back in the day you took these legacy SCSI devices out connected via SAS or SATA and then put in these SSDs instead of disk, and you got some performance boost, but it wasn't really a whole bunch in comparison to what you could get out of it if you were accessing it in a much more native form factor.
And so in our engineering teams, we focused a lot of our efforts on optimizing the internals within the system first. We actually built our own kind of SSDs that were much more attuned to the access patterns of what we would do in software. And we took the intelligence that was usually inside these devices and moved it into software in the controller pair so we could much more globally manage things. This actually added in not only concurrency but reduced latency. We actually... saw about a 4x increase in performance just by swapping out some of these things, both NVMe and optimizing the software to be able to take on the parallelism concurrency. And then from the external network side of things, it's all about integration within the stack and trying out new different protocols to be able to bring new benefits to customers, whether it's RDMA or NVMe over fibre channel or the likes.
So you mentioned that you designed your own SSD module, but in general terms, if we look at the entire industry, do you think that the old NVMe is also contributing to these new form factors that are surfacing? I mean the rulers and all that other fancy stuff that now we can put in our arrays, or even in the servers.
So I think that you're finding now that everyone's re-looking at the way that we used to define a storage device both in protocol stack and also in form factor. If you were to open up a 2.5 inch SSD, you'd find a truckload of just empty space in it. And I think people started to realize quickly that there's really no point in continuing to keep that same form factor just because it existed in the past. And this became even more important as you started to look at mobile devices. And your laptops of today are so skinny and thin that you couldn't just do a 2.5 inch form factor in those devices and it wastes a lot of space.
So you started to get into a world where not only did you get more optimized in the protocol stack, you actually started to embed things much much closer to the CPU, right on the PCI bus which was really where NVMe was was focusing a lot of their efforts. And so new form factors such as like the M.2 came about where you were able to take a much more small form factor device and put it into a system, but get direct on motherboard access to that device. And so this was great, but if you look at the bigger space like in the bigger datacenter side of the world, they also realized that the 2.5 inch form factor was was also not the most dense solution to play out.
And so you saw a lot of solutions with the Intel ruler as an example, where they went for different form factors so they could get much higher density and be able to populate a system with more and more devices and as NAND densities increased to obviously get the density of performance as well. And you're seeing a lot of new conventions that come out there like the EDSFF team that's working on some of these. They're coming out with really innovative solutions to be able to come up with higher and higher densities.
Yeah. As you said earlier, the fact that you have all of this parallelism at the end of the day allows you to be able to insert these steps, then insert devices. You mentioned the network a couple of times and this is I think, a very good moment to change a little bit the topic, not talking about NVMe as a device, as a protocol, but actually the over fabric on top of it.
So if we look not only at the device or the internals of the array but actually we look at the storage network, we have NVMe over fabric. What is the difference between NVMe and NVMe over fabric?
So NVMe was very focused on the interface of the device and the protocol stack to interface with those devices. What NVMe over fabrics does is really just extend that over a networking fabric, so if you think about it, what it was doing is it was optimizing the ‘e’ part of NVME it was optimizing for PCI express and so it was really focused on taking a much more efficient mechanism of accessing that to a CPU.
Now over in networking infrastructure, there was opportunities to do PCI networking, but it wouldn't go very far and it was pretty limiting in its scale. And so what NVMe over fabrics does is now allow you to be able to scale a PIC like connection over a networking infrastructure to get all the benefits of NVMe all the way out to the host now. And so instead of using that device locally and server and getting the NVMe benefits, you can now do that over an expanded network infrastructure. And it comes in a bunch of different form factors.
But effectively you're using NVMe over that networking infrastructure and there's some benefits to doing one versus the other, but people are really truly looking to be able to scale out those PCI connections over networking infrastructure. And if you look at why, you've got multi core processors, you know much faster network than you've ever had in the past. And so being able to scale those networking cores directly out to the storage media itself enables that higher level of parallelism and concurrency even further than what you'd get in a single system.
So at least from the perspective of PURE, okay what is the difference between a fibre channel based network and NVMe based network?
So first and foremost, fibre channel has been known as the lowest latency enterprise wide mission critical application solution, and it's done so because it has very low overhead in the way that it transports data between those two devices. But again it's been held to the traditional SCSI protocols of the past. What NVMe does is again takes that new queuing methodology and concurrency and then applies it over multiple different types of networking infrastructure, some of which are even more optimized than what fibre channel would be.
As an example, we decided early on to go with RDMA over converged Ethernet which was enabling converged ethernet adapter effectively to do RDMA transfers between the host and the endpoint device vRA. And this allowed for increased amount of optimizations to a point where we see even about a 25% or 30% boost in performance in comparison to fibre channel which was kind of the industry leading low latency solution. But to get 25% or 30% better latency in comparison to fibre channel was a really big win for us and shows you how truly optimized this protocol stack is... not to mention the concurrency that you get within the software stack as well.
And somehow NVMe over fabric brings the best of both worlds. I mean very high performance like DaaS or internal storage with the flexibility of the storage of a network. So do you see NVMe over fabric more suitable for bare metal scale clusters like Big Data, HPC application or more as a new faster option to building in a traditional sense?
I'd say all of the above, and it's kind of interesting. It's going to be a transition that happens in multiple different phases. And my theory is that the areas that already today use very heavy, faster networking infrastructures especially around ethernet in particular, which you find in a lot of these bare metal scale out environments, they were going after a very low latency result and that was why they decided to put devices in their server and so you saw a lot of like Fusion IO and Virident cards originally kind of played in this particular space. And this was getting a PCI base flash device in server, super low latency, and you could really tune the application to it.
The unfortunate part of that particular type of solution is that it was highly inefficient, meaning that you would have you know applications that have different amounts of CPU and capacity, and so you would end up with stranded CPU or capacity all over the place. And so people were willing to pay for it back in the day, but as they continued to consolidate multiple applications, they were looking for better efficiency.
If we look at whatever our hyperscale friends did out there, they started off in web scale very similar to what people were doing initially with these scale out applications and then longer term, they started to disaggregate and they disaggregated due to the efficiencies they could gain. And so for us we first see phase 1 of this kind of falling into more of the Ethernet realm, meaning that those bare metal servers typically are connected through the ethernet. They're running Linux for most of them, which already has a decently ratified driver for NVMe, and so we chose, at least initially, to go with RDMA over converged ethernet, because it gave huge benefits around efficiency, not to mention about a 25% off load for CPU to reduce the IO stack.
And so that I believe is going to be phase 1. You'll find more of the bare metal servers fall into this arena and there's some huge benefits we saw with MongoDB, MariaDB. and Cassandra: between 30% and 50% more transactions per second, even comparative to local attached SAS SSD. But I think phase 2 is going to play out more in the traditional space where VMware I think will lead this as soon as they become supportive of NVMe over fibre channel in particular. I think the industry will shift dramatically to start taking on advancements within this protocol stack.
OK so let's say that the early adopters are the usual web scalers, so large infrastructure and immediately after that we would see an adoption also with enterprises with more traditional protocols, I mean also more traditional networks in storage, like FC for example.
Do you have any need to see a comparison for a PURE storage array with NVMe other fabric against DaaS or it's just a matter of perception? I mean we can all think about the duplication compression, the fact that every volume is the same provision, and so if you give to a bare metal server or something like that, it's better than having a pile of hard drives or SSDs installed in the system because at the end, a dollar per gigabyte maybe is even better for the scattered array than for the internal drives. But do you have any information from the field, measured on the field?
Yes, so what we're seeing actually... the most advancement in this particular space is more around software as a service (SaaS) solution. So we've been doing really well in the cloud based solutions because people are looking for deep innovation into new ways of deploying in a highly efficient manner. And so an example is: imagine if you were deploying on a two rack unit server, and you did this because you wanted to actually have many drives or many PCI slots, let's say, within that server. We're seeing people to be able to consolidate down to smaller and smaller servers and even be able to put these on blades instead of rack mounted servers, which they were never really able to do in the past because they couldn't get those super fast drives directly into server.
In that arena, we see about a 4 if not more times CPU density per rack that people are able to achieve with disaggregation. We do see about a 25% CPU offload with RDMA, so this gives a lot of CPU back to the application, not to mention that you can further consolidate the amount of density that you have within that rack.
Now from a capacity front, this is where things get really interesting. So we see some applications out there where they really only need two to four servers worth of CPU, but they're running across 50 servers because of capacity. And so this was that whole stranded CPU capacity vs. CPU here. And so we're seeing some that are nuts: from 4 to 20 times consolidation of capacity that's being provisioned. And so we've been able to do, let's say, a top rack flash solution becomes a massive benefit because you can consolidate down CPU a huge amount, 20x in some cases, because you were really scaling nodes by capacity.
And then from a performance perspective, we're actually seeing benefits where between 30% and 50% faster transactions per second. This is due to the nature that as nodes scale, they become less and less performant within that application because there is more replica sets and the like. And if I distill all of these things down together, it's really kind of performance benefits that customers get a really around DaaS latency and NVMe concurrency, but a lot of it's really around consistency of performance. So the business impact of that's huge: to be able to spin up new instances, or snapshot technology or just to get consistent performance just as a whole.
Yeah the numbers are impressive. And you didn't mention, for example, power consumption because you are consolidating so many things that were not possible before, so a data center footprint and so on. So again the TCO probably is way better than for any single device that you can put into a server.
But back to NVMe over fabric, so it is supported on FC, InfiniBand and Ethernet, well as you said Ethernet is probably what is getting the most attention at the moment, with InfiniBand on the HPC environments. Do you think... FC will ever take some market share? Or in the end, [are] people getting used to Ethernet and even in traditional enterprises they're moving on that media for NVMe over fabric?
I think there's been a lot of discussions years and years. Fibre Channel was going to die off, [but] it's been a pretty resilient stack. I'd say that in the traditional sense of enterprise applications, I think it has a lot of longevity. I do think that there's a lot of people (myself included) that believes this kind of convergence of the networking infrastructure adds in a lot of simplicity and in a lot of efficiency. So going to a converged infrastructure adapter, just as an example, reduces the need to do fibre channel and Ethernet all on the same stack; and gives you some inherent benefits of, let's say, RDMA offload as well. But Fibre Channel though is a very low friction way to get NVMe over fabrics as well and so since it exists today and it's supported to be able to use it for both NVMe over fabrics and just traditional SCSI Fibre Channel, I think you're going to see that get some good traction in the enterprise because they will be able to deploy it without needing any sort of cards or changes to their infrastructure.
But I think, longer term, people are going to realize that there is a much more efficient model being able to deploy and I mentioned a lot of these in the previous section, which was higher densities, but you get on a point where if you stay in the exact same configuration you have huge amounts of power cooling rack space reduction, not to mention the reduction in costs and the simplicity of being able to deploy these things easily. But I'd say that know customers are looking for low friction ways to deploy this now, and they'll look for a longer term higher efficiency as like the hyperscale architecture we've been deploying as our top rack flash solution.
And do you think that NVMe over fabrics will have an impact also on scale out design? What's your take on it?
So I'd say “Yes.” The cool part about NVMe over fabrics is you really can scale out the access points and concurrency within the system that you weren't able to do on a traditional SCSI device in the past, and NVMe over fabrics just allows you to scale that even further over the networking infrastructure.
From a scale out design perspective, I think that people will start to look at new ways of deploying architecture. We call this this hyperscale architecture ‘top-of-rack’ flash solution. We've been looking at compute within a rack and its resiliency across rack, a top-of-rack flash solution becomes incredibly valuable and efficient when you have an efficient protocol to access it, and then you scale this almost like you stamp them out, and so infrastructure then becomes more kind of built as code and automated and orchestrated around this new type of architecture. And so I think NVMe and NVMe over fabrics become new ways that open up worlds that didn't exist before for customers to deploy much more efficient and cost effective architectures.
And so... we always mentioned NVMe over fabric but actually the industry is already working on a new protocol still based on NVMe which is NVMe-TCP. It's easier to implement because it doesn't need NVMe, it doesn't need specific ethernet cards and so on, and maybe somehow we can compare it to back in the old days, to the introduction of i-SCSI. What's your take on NVMe-TCP?
So first off TCP is such a robust protocol that it will be here for a long period of time. So I know that NVMe over TCP will actually have a decent uptick. Now the downside to it is: it's still a relatively heavy stack. And so the benefits that you see, let's say, in RoCE or NVMe over fibre channel, there won't be as big of an impact and you won't see huge huge benefits associated to it. But it will enable new concurrency. It will enable new ways of optimizing software to access the storage media itself.
So I think that you'll definitely see an uptick on it. You just won't see as big of an impact like you see in very, very low latency solutions that are really focusing their efforts more on RDMA over converged Ethernet as well as NVMe over fibre channel. But we're innovating in that space, and we think that customers will take advantage of it.
OK I get it. And Chadd, it was a very, very nice conversation and I think our listeners will enjoy it. So just to wrap up a little, where can we find some more documentation about NVMe? I don't know if PURE has something educational on your website?
Yeah, well first off thank you so much for the time today. I really enjoyed the discussion. We have a whole bunch of details around our launch of the direct flash fabric which was our NVMe over fabric solution as well as Flash Array X, which is our 100% NVMe based solution. You can find those on our website under products, under flash racks.
We also do have a bunch of blogs focused on how we would actually deploy these across next generation applications for more scale out such as MariaDB, MongoDB and Cassandra and a lot of really cool solutions on how we're getting better efficiencies than direct to tech storage, but also scaling these new hyperscale architecture approaches into the enterprise.
And are you on Twitter, if somebody wants to contact you and continue this chat?
Yeah that would be great, I'm @ChaddKenney on Twitter.
Very great. Thank you again and bye bye.