Today's leading minds talk Data Storage with Host Enrico Signoretti
Bill Borsari leads the sales engineering team and is focused on customer success. With more than 20 years in the storage industry, Bill has held senior positions with Nimble Storage, Seamicro and Parascale enabling him to bring a unique and broad perspective of the storage industry to Datera clients. In 2015 Bill organized the 30th-anniversary event for the Amiga Computer at the Computer History Museum in Mountain View, CA. Bill holds a BA from University of Maryland College Park where he double-majored in Decision & Information Systems and Management Science with Statistics."
Enrico Signoretti: Welcome everybody, I'm Enrico Signoretti and this is Voices in Data Storage brought to you by GigaOm. Today we will talk about how data storage architecture has evolved over the years. From monolithic to scaled out design, from hardware to software defined, different media types, different technologies, everything that changed how data is organized, protected and stored in the can or accessed in the frontend.
My guest for this episode is Bill Borsari, Director of System Engineering at Datera. He will embrace this new architecture of design paradigms. I met him last February at Storage Field Day 18, after almost three years and we had some interesting conversation about flexibility and adaptability of [their] new storage platform when compared to other ones. Hi Bill, how are you?
Bill Borsari: Hi Enrico, I'm doing very well today. Thank you.
So thanks for accepting my invite and joining me today. Why don't we start with a brief introduction of yourself and your company?
Sure. Thank you very much. So my name is Bill Borsari, I'm the Director of Systems Engineering for Datera. Me and my team help customers understand and deploy the Datera technology in their data center environments. Datera is first and foremost a storage technology, really designed to help bridge a gap between the traditional infrastructure requirements and the modern infrastructure requirements.
And the reason why we feel this product is necessary is there's a tremendous change underway in terms of how services are delivered in the data center, how companies’ relationships with the cloud [has changed], whether that be an on prem or a third party technology like an Amazon or Google. And for the customers that Datera is helping, also dealing with the legacy technologies, the technology that they've been executing in their infrastructure and running for many years, and helping them bridge that gap between those two aspects.
So we've said differences between traditional and next generation data storage systems and helping organizations to bridge the gap between legacy and next generation architectures. So there is a lot to talk about here, -- probably too much for a single episode, but I'm not even sure where to start here.
But let's start from the basics. So in your opinion, what are the major factors that drove this change? I mean new media types like flash, fast CPUs, networks from the infrastructure point of view, but also from the outside, I mean the cloud and new agile kind of development processes. What was the main factor?
Sure. I think what's happening now is a change in the way applications are designed and delivered is really driving the largest impact to the data center. If we think back to 2008 when flash media was beginning to become available in a number of form factors and the prices were beginning to come down, we saw a number of entrants into the storage market. Companies such as Pure Storage and Nimble, Tintri, the list is very, very lengthy to rattle them off. Those companies were formed around Flash.
Flash was a revolution in storage that they were trying to attack in the marketplace. And I think as we come forward, what's happened is the applications themselves have also been adapted and changed in such a way that it really challenges what it means to have traditional shared storage. Take for example an application like Cassandra, -- very, very popular way of delivering a database style service in a scale out application platform. You don't want to run all of your Cassandra instances off a traditional dual head array because you lose some of the benefit of what Cassandra can offer in terms of distribution across nodes, racks, really changing the failure domain. It's done that because it has its own built in data services.
So that's just one example of a new style application that's really challenging traditional infrastructure, whether it's Flash or disk or a Skylake or Cascade Lake CPU, -- those are all great technologies, but at the end of the day, I think the big paradigm right now that's driving the need for a rethink of the storage environment is the application side.
Our application of design changed it a little bit. How we consume storage or what do we expect from storage. But on the other hand, we had these conservative storage approaches. I mean the system’s based on highly coupled controllers and meaning they look like HA clusters, from 2 to 8 nodes, -- the old monolithic arrays or the two controller arrays that we had for decades now.
And that was a kind of perfect that was very consistent, very predictable and also easy to use. Well now these large systems are designed around the ‘share nothing, scale out’ cluster layout. I don't know. Is this compromise acceptable? Aren't we losing something there for having this scale out, fancy new design?
No, that's a great question around what changes... I think the comment around the availability of a traditional, tightly coupled array design is very true, but the challenges of availability came at the price of agility, and when you start thinking about the amount of resources available to perform the job... When you have two controller heads and one of them fails, you're still running but you've lost 50% of your resources. On the scale out side with the technology like the way Datera operates, if I have a 10 node cluster so I'm using 10 commodity servers. Each one has its own media and they're operating together. Then if one of those nodes fails, I've only lost a tenth. I have nine nodes that are continuing to operate in the environment so, fundamentally, the larger the systems scale out, the [more] opportunity is there to make them more resilient.
What's changed and why this is more possible today than it's been in the past (and we've seen a number of storage technologies over the last 20 years try this), but what's really changed to enable this is around the networking. The network basics are so fast, the plumbing is easy to get right, and so it's it's very easy now to build a scale out storage system using modern networking that is able to deliver the predictable performance and the reliability while still changing the way that the dual head architectures work and delivering significant value through the storage.
So you mean that the data path from the host to the media where data is stored safely at the end, drastically changed to control design or multiple control design. But a single monolithic a single box was a really very simple design. Now we have scale out, but we have new technology that helps to reduce the latency or reduce the complexity somehow because it's easier than in the past to build high speed networks and so on.
So you are somehow telling me that now with a microsecond latency we have in the network, it's very easy to build this kind of system and with commodity hardware, also we got it right. But actually the complexity around this scale out system is still there. I mean that you need a lot of mechanisms that should govern all the data movement into the can, and for example, metadata handling is something really complex compared to a single system.
Yeah absolutely. I mean one of the things that makes all of this work is very smart software. And you bring up the metadata portion of it, and that's a great way of distinguishing what Datera does versus other scale out technologies in the marketplace. So for example, if you look at the computer science of managing data at scale, you have to ask the the system, ‘Where are you storing that relationship between the customer's requested block that you're saving and the physical drive where it's being saved to?’ And as you think about a scale out architecture, the more drives you have, the more nodes you have, the bigger this problem gets.
If you just think about it in 4K terms, a 4TB hard drive multiplied by 24 per server, multiplied by 10 servers is a lot of 4K blocks to track. And what we've seen is a lot of architectures use a consistent hash and other procedural mechanisms to calculate the relationship between I'm a consumer and I'm an operating system I have a block device, I'm gonna write the header of the disk, I'm going to save that to the logical block address ‘zero’ and then the back end storage system has to put that somewhere and then they use a calculation technique.
The challenge with the calculation technique is that it becomes very rigid. You have to treat all of the blocks for a given deployment, a given calculation, as the same. But Datera architecture treats metadata very differently. And what we've done is we've sort of created a two stage system, where you've got a high level of routing decision: who owns the data, which of the nodes in the infrastructure actually have it, and then the node itself where the data is landing, worries about the metadata on disk. So by separating those two things out, we feel that we've built a much more efficient and a much larger scale capability than some of the other technologies that are out there.
OK. So you somehow overcome the limitation that usually we see on a system like object survey. They have this huge distributed hash table. That has a lot of limitations in terms of clustered layout modifications or reading it becomes more and more complex to query this hash table when the system grows. So by doing it in the way you describe, you probably have a certain amount of Ops and they are always the same, so you can get some consistency from your queries right?
Yeah. And it goes one step beyond that, and it's really about agility. So because the system, the distributed system, only really cares about how data moves from the client to the storage, and it's the nodes’ local storage job to figure out which block the data is stored on and how it's protected on the local context, it also means we're very agile. It means that as you think about adding storage to that deployment, it really doesn't matter that the one that you're adding looks like anything you currently have. And the reason why that's incredibly interesting for a lot of our customers, -- not only does it give them a strong cost control, because as we know not all flash media costs the same, premium NVMe device at a large number of terabytes is significantly more expensive than a SATA flash consumer grade technology.
So the Datera architecture does a great job of managing those different types of media through being different types of servers. Then our metadata handling is very good through policy, at allowing customers to say this application should be on this type of system, and then it becomes a question of what systems you've plugged in together to deliver the storage to the application.
If I may, as an example, one of the folks we're working with is using a competitive product and they're running their enterprise business on a product that only supports two copies of data, which means that it's effectively at rate one across servers and as they look at their requirements and their scale, they're nervous about that. With the Datera architecture, one of our policy settings is the number of replica copies. And when it comes to high performance scale out block storage, replicas are preferred versus other techniques such as erasure coding because of the performance implications.
And so that customer is very excited about the Datera architecture because one of the things we can do is actually have different types of systems participating for the same volume, so they can have two of the replica copies on Flash Media and a third copy on a hybrid media, and we can do that in our architecture because of the decisions we've made from a technology perspective, and the customers are very excited by that. This is an example of the agility that our technology provides.
I see, and in the research I'm conducting for GigaOm, I examine system capabilities in three groups: table stakes, critical features and game changing capabilities. Table stakes would include everything you give for granted in an enterprise storage system, like data protection, snapshots, high availability and so on.
I don't think those make any difference to the end users or at least that they don't make a difference anymore. Why in critical features [do] I put things like analytics and the NVMe and the NVMe over fabrics on integration and so on. From your point of view, looking at your customer base, what are the most interesting capabilities for your own use and what do they expect to see from the next generation of data storage system?
Yeah it's a great question. It's sort of ‘why are we in the room?’ And I agree that as we look at traditional capabilities, that sort of marketing checkboxes, there's not a lot of differentiation available between the different products and there's typically a lot of storage products on the market. But I like your list around analytics and what not. I don't think that there is a single capability, a single most interesting capability that jumps out for users.
We see different organizations struggling with different challenges. One of the things that we do at Datera is we operate on standard protocols. And so as customers evaluate this transition from the legacy architecture to the modern architecture, and just a little bit more definition: legacy would be a number of applications that are very familiar data center and architects like Oracle and and many other applications that were fundamentally designed for the infrastructure to provide availability. And the modern applications are designed for the cloud, and on Amazon where your storage is three nines, where they provide that availability. That's a key differentiation.
So having a system that can support that type of difference -- who's providing availability -- is very interesting to a lot of these companies because they they need something that can on one hand provide full reliability for an Oracle database, hence a three copy deployment or on the other hand be able to do rack scale deployment so that they're not having a challenge when a rack fails with their Cassandra deployments. So that's one aspect but you know it it all works together.
So the concept of a scale out storage system built on different types of nodes does require a lot of analytics internally in the system to take the weight off of the system administrators from having to figure the day to day minutia problems out. Having that capability in the system allows us to react to failures, -- whether it be a rack level failure or whether it be a single link failure. You know how the system reacts to that and and manages that traffic is very important.
The cloud integration is a fantastic challenge that the customers have. How do they get that data freedom between their legacy and traditional data centers and their new cloud deployments? And these are all things that we have answers to at Datera. And so I don't know if customers are looking for a single most interesting capability.
I think to your point, there's a shift in terms of what the if the game-changing technologies [are] becoming table stakes is what we're seeing in general and it's [as] much of a checkbox as we had previously. How do you integrate with the cloud? How does your platform leverage analytics? Can you support the new Flash Media, not only NVMe but also the 3D cross point from Intel, delivered in the Optane family today and coming onto the motherboard as a dim form factor in the future?
Do you think that if you look at the market today and the technology available, there are some advancements that we will see in the next... 12 to 18 months that will be a real game changer, or will it be just an evolution of what we have already?
Yeah that's an interesting question around what's coming in 12 to 18 months. I think the next big shift is going to be the 3D cross point storage on the memory bus. So this is the the Optane Dimm. Intel has a new name, -- forgive me if I don't remember it, but the idea is that you can now have a server with 24TB of a stable storage that is somewhere between the fastest NVMe and RAM. One of the early use cases for this technology is for virtualization and offsetting traditional system RAM for the memory of VMs for those workloads that don't require the current DDR4 memory bus speed.
I think this is just the beginning of what that technology means, but I think it's also going to take quite a while for the applications to... and when I say applications, you know what's driving the data center today in a lot of ways [is] the open source applications and it's going to take a while for these open source applications to really respond to this new technology and really think through how to take the most advantage of it.
From a shared storage perspective, I think what ends up happening... If we look at traditional RAM prices in the market today and around databases the IO pattern of a shared storage device historically had been dominated by Reed-[Solomon codes]with some writes, and now we're seeing customers routinely deploy a terabyte of RAM for their database servers. And what that does it takes most of the Reeds into the local cache and it means that your storage system has to deal with writes.
And so we have some customers where their write workload is 80% of what the system is doing. And that's another thing Datera has done, is we built an innovative data plan that uses a lock-less coherency protocol as opposed to the traditional style of having to lock the back end in order to do updates. We can do all that at scale using traditional protocol without having to take those locks.
And so that's another major problem that scale out technology has had in the past that Datera's overcome. But the point I'm getting to is that the pattern has shifted, and as we start thinking about what the additional high performance memory is going to do or our stable memory is gonna do on these nodes, I think that's where we're gonna have the biggest change in terms of what's happening in the system versus what's happening on the shared storage content.
So you said a lot of interesting things. Maybe now I can be a little bit more provocative and ask you a couple of questions. You mentioned the fact that memory class storage will finally happen and I couldn't agree more in the sense that the signals are all there, to bringing storage closer to the CPU, faster storage, process CPU and use it in a layered or tiered fashion would be a game changer.
But actually, there is another part of this story which is new technology is coming. So we saw for example, with virtualization, a lot of VMware especially exposed a lot of APIs, a lot of functionalities to the storage systems and now they can interact very easily somehow, it's very, very seamless to work with storage, okay, so the hypervisors know the kind of resources that are available and associated with the VMs that are needed and so on, and even more so with that with Kubernetes or container orchestrators.
You don't even think about storage anymore, but it is just a set of APIs and you just ask for the resource, you got it and you use it and then maybe a few minutes later you can destroy it. So there is no human interaction anymore. And somehow everything becomes more transparent but also flat. If I can provide you that kind of resource, who cares about the back end?
Absolutely. And Kubernetes is a great use case for Datera and you hit the nail on the head. The developer consumer does not care about the back end. They do, if it's not delivering what the policy states or the expectation, but the idea that you have an orchestration tool alike Kubernetes where it...
Let’s break down Kubernetes versus VMware very briefly. So VMware was a server virtualization technology. The construct is: I have servers and I want to put a bunch of them on the same computer. So VMware's technology is really around that problem. And what does it mean to have a server that is not on bare metal and drive that operational capability. And in the VMware days with the VMFS data stores there used to be the discussion around the IO blender, because I have now taken a bunch of servers and I've stuck them together into the same storage profile. And to your point around changes VMwares made, -- they had been working on this problem for a long time and they have a number of technologies to address it.
Now let's look at the Kubernetes side. Kubernetes is around managing containers and delivering a controlling platform for the operation of containers. And fundamentally with the container you make resource requests dynamically. Whereas if you think about traditional servers, there's nothing dynamic about a traditional server. It is physically plugged into a network. It has a physical motherboard with components. That's the paradigm that VMware was designed around. When in the Kubernetes space, the organizations want to be able to spin up that application, that set of pods with all of the rules around affinity or anti affinity, all the storage requirements and networking requirements, security requirements all enumerated and defined in that YAML file and they want to be able to do it whether it's on prem or on Amazon. Fantastic, and the technology easily allows this.
So the question becomes ‘What do you do from a storage perspective? Well the container world is very much ephemeral at the moment. And so we see that the majority of use cases for containerization are not actually storage bound. But now those organizations that have started to adopt Kubernetes are realizing the value of being able to declaratively deploy an application and to be able to introspect that application. The security ramifications are huge. When the security team has the ability to understand all of the component layers of an application, and can trigger an update through the simple termination of the running pod, and when it gets rescheduled and restarted, it picks up the latest security changes. In a virtualization environment like VMware that is much more difficult and requires a lot more coordination between the server deployment and the security team. Can you even introspect the Apple OS, can you even log in?
But so these are the driving reasons why people want access to Kubernetes. When it comes to the storage side, there's multiple consumers and there's multiple people involved in storage. There's the consumer, the developer who's saying I want 40GB of flash storage and then you've got the operators who are making sure the system runs and then you've got the architects who are the ones designing and deploying.
So as you start looking at those different consumers and the different roles they play, when you look at that dual head architecture, going back to that traditional storage challenge, they can support Kubernetes because as you mentioned there's a plug in technology, a connection between the Kubernetes and the storage that's well understood. It's open source, it's easy to figure out and now with CSI it's even easier. That's the container storage initiative.
So now you have a situation where I can have my developers creating volumes, storage requests based on their policies and their requirements. How does your storage manage that? If you have a traditional dual head product with maybe three tiers of storage and you've got developers constantly creating and destroying, that's not what those systems were designed for, and can they even create things fast enough?
Can they deliver different qualities of service at the storage level? Not only IOps control but also media control. And what do you do when your Kubernetes environment scales out to 50 hosts, and you've got a dual head array you've got to go buy more heads and another array and then more storage? And then you have to figure out, ‘Where am I going to place this next one? Is that figured out by Kubernetes is that the driver? Is that the user?
That's where this scale out technology Datera where we have a control plane. That's basically the brains of the operation that not only mitigates any kind of failures and keeps the system running optimally but also allows us to say, ‘Oh your Kubernetes environment is scaled out to 50 nodes you know. Our analytics can tell you how many nodes your storage environment needs to scale out to, and more importantly, what those nodes need to be in order to be optimal. Everyone would love to deploy NVMe flash at scale and just be done with it, but organizations that are are truly operating at scale are very mindful about the cost factor. So all this stuff sort of comes together and Kubernetes is a great example of ‘modern’ and a lot of ways modern is self-service, and how you build a storage system to react to that. It's a serious challenge and it's a challenge that Datera solves.
Somehow at the beginning, you mentioned operators, architects, and developers but actually, you didn't mention another role, which in this case, is the one that benefits most from this, which is the guy that pays for storage.
Yeah absolutely. I mean the financial aspect is huge and I think the companies that have embraced software-defined storage in every instance, especially at scale, -- not every company’s at scale, but the ones that embrace it at scale, the financial impact is staggering in what they've been able to do with that dollar.
If your dollar gets you half a gigabyte because of your platform, you know, that's okay. If that dollar gets you a usable gigabyte or two usable gigabytes and that's across all applications whether data reduction is there or not, that's pretty interesting. So different companies operate at different scales, they have different challenges, but what we've seen is that companies that are deploying traditional storage technologies -- their CFOs and leadership has been very happy with the price of moving to commodity hardware with intelligent software for their further storage. Absolutely.
Another thing that is always overlooked I think is quality of service. I was okay with the lack of quality of service when when we had the two controller system or even when one of these big systems like the VMax or HDS USPs, because most of the workloads were very similar, sometimes were connected to very few machines.
But now especially with scale out system, there are more storage infrastructure than storage systems, that they can be pretty large. They can sell 30 storage workloads, storage that is bare metal also, and then utilized infrastructure as well as containers, -- they work very differently from each other. And then at the end of the day, I think that now quality of service is one of these things that is really necessary. I know on the other side, everybody doubts me, -- with flash that you can have today with the performance brought by Flash, nobody really cares. But I'm still there every now and then looking at this aspect of storage and still struggling to understand that both when I talk with customers and vendors what the truth is. What's your opinion on this?
Quality of service is a bit of a loaded term. At the highest level all it means is you have a service and what's its quality.. A lot of the implementation of that historically had been through media selection and then also through slowing things down on an IOps or throughput basis. To say that I'm going to provide this application ten thousand IOps and I'm not going to put it on disks. And I agree around the traditional systems and the challenge there that the dual head architecture... and it goes back to limited resources. As you start thinking through the Promit scale and asking “OK how much throughput can I get to the array and how much does it cost me?” You can get a raise that can drive a significant amount of throughput, but if you're driving that throughput, you'll lose the IOps because there's only so many network connections that a dual head array can support.
And even if you look at some of the technologies that could go beyond two heads and build a tightly coupled 8 node storage system, -- this is like our VMaxs and whatnot, the director class they could support significantly more bandwidth but any of the problem at all that bandwidth is coming back to one place in the data center. So if you have the infinite amount of bandwidth and the infinite amount of storage performance, then you start looking at how do I meter out, in individual volumes capability so that it doesn't overrun the system resources?
That's really the value of that IOP level PoS: to sort of bring it all together in that server virtualization model where everything is static and controlled. It's good to have those limits, but they become less necessary because you can sort of architect around that. In a self-service model like Kubernetes, where the number of volumes that are going to exist is almost unknown because you don't know how your developers are going to consume it. Then you really need to start putting on rails, and making sure that you have quota systems and that includes around the performance so that you do allocations and you manage through that allocation and whether that's a capacity allocation or a volume count or a QoS in terms of number of IOps, you have to have that and then you start thinking about a storage system that's doing both of these things at the same time. It's supporting a VMware traditional deployment and it's also supporting Kubernetes. Then it becomes even more important that you have good quality of service, and that you can say you know this tenant is fenced off from this other tenant. And here are the different technologies that can be deployed to get there.
So I think QoS is significantly more important in a cloud style deployment and a self-service style model where there's no planning cycle for the developer for the deployment of an application.
You made the comment earlier about the people that pay for it. Well one of the the ways that traditional server architecture is managed is you open a ticket and the ticket outlines what you need to have and then someone goes buys a server they rack it; they install stuff and then they start your VM and we've heard stories going back to the bare metal world of six months before you got your resource in the virtualization world, it's anywhere between days to months depending on the organization. And in this in this self-service Kubernetes cloud world you're talking about, you mentioned minutes before looking at seconds but the number of seconds it takes to deliver volume is important. And so it's very difficult to sort of build a single storage platform that can actually do all of those things at the same time. And that's what we think we have at Datera, is a technology that can actually do all of this at the same time.
Yes I see. I see your point and I agree that somehow it all depends on the use cases and how the storage is really consumed. So I think that now it's time to wrap up, but it was a great conversation. Thank you again for joining me today. And maybe you want to give a few links about Datera on the web, a Twitter account for you and your company, so if our audience wants to continue the conversation online the they can follow you and ask you questions.
Sure. The the official twitter is @DateraInc. Our CEO is Guy Churchwood and he is @GuyChurchwood on Twitter. You can also follow us on LinkedIn. Our web page Datera.io has access to all our social media. We have videos up on YouTube as well describing the architecture. As you mentioned, we've done a number of Storage Tech field days where we've [done a] deep dive into various aspects and answer questions. So there's a lot of material online. People are welcome to reach out to me via a LinkedIn as well. And you know we're happy to answer any questions.
Very good. Thank you again and bye bye.
Thank you so much for having us. Really appreciate it.