Voices in Data Storage – Episode 14: A Conversation with Avinash Lakshman of Hedvig

:: ::

In this episode Enrico Signoretti speaks with Avinash Lakshman, founder of Hedvig, about the relationship between data storage and data management.


Avinash is the CEO and Founder of Hedvig. He founded Hedvig in 2012 after co-inventing Dynamo while at Amazon (2004-2007) and Cassandra at Facebook (2007-2011)


Enrico Signoretti: Welcome everybody. I'm your host Enrico Signoretti and this is Voices in Data Storage brought to you by GigaOm. In this episode we will talk again about how data storage has dramatically evolved in the last few years, and how data management is becoming more and more important for organizations of all sizes.

Today I'll be joined by Avinash Lakshman, CEO and co-founder of Hedvig, a Silicon Valley startup with a software defined and scale out starter solution. Before founding Hedvig, Avinash Lakshman was at Amazon and Facebook where he designed large scale systems like Cassandra and DynamoDB. His experience will help us to dig a little bit deeper into the relationship between data storage and data management. Hi Avinash, how are you?

Avinash Lakshman: Good and thanks a lot for taking the time to talk to me and I really appreciate the opportunity. Although regarding the introduction, I had one small correction: although I am flattered when you associate my name with DynamoDB, I had nothing to do with DynamoDB. My co-creation was Dynamo, which preceded DynamoDB.

Okay, very well then. In a recent paper that I wrote for GigaOm, I separated system functionality into three categories: table stakes, critical capabilities and near-term game changing technology. Table stakes is the set of functionality that people take for granted including data protection, snapshot, remote replication and so on. While the critical capabilities are those that really make an impact in terms of CCO, flexibility etc, like analytics or integration with the cloud for example. What are the features of storage systems that really make a difference in your opinion, Avinash?

That's actually a fantastic question. In fact when you talk about table stakes, you talk about snapshots and remote replication, but I think in this day and age replication has become a lot more sophisticated than what people were used to seeing with traditional storage systems. For instance if you were to take remote replication, typically driven by either a separate entity that sits on top of the traditional array and does the replication and it's very cumbersome to set up its point to point. It's for the most part unidirectional. If there is any configurational issue that happens it becomes a nightmare to track down and to debug, and it also increases a lot of clutter in the data center.

More modern systems have taken a very different look at how replication could be done. In fact given that we are getting into an era of you know where we are in a global economy but the data governance, data sovereignty laws are all local. So you have entered an era where applications mandate the geographical boundaries within which the data ought to reside.

So what I believe is required from a replication standpoint is the ability for one to be able to declaratively pose the question or expect from an underlying infrastructure that for my application I want its data to be replicated across regions A, B, & C and it's up to the infrastructure to actually live up to that. And that kind of paradigm never existed before and it is becoming more and more a necessity to deliver these kind of capabilities.

That's about remote replication, and regarding snapshots too I think now for protection against things like ransomware which are becoming rampant and every enterprise is looking at that, having a very, very sophisticated snapshot in capability becomes key. So all those snapshotting, like you said it's table stakes, but the requirements have become a lot more sophisticated than what people are used to seeing. That's my take on it.

When you look at the other side which is what you call critical capabilities, you're very right in terms of TCO and flexibility. The key is not just CapEx but the key is to drive and/or increase operational efficiency. And that is what's going to give modern enterprises a maximum bang for the buck. And we can talk more about that as we continue.

Yes so your take on remote replications and snapshots is really interesting. Are you talking about a sort of policy engine, so user defined policies more than rules to make it real? So not static rules but actually more declarative?

Exactly. That's exactly what I'm talking about. In fact if you look at computer science in general, any time there has been a major primitive that made its way into idea or into systems it first starts out by being a programmatic function and then over time you'll make it so simple that you make it declarative in nature.

The biggest example of that is transactions. When transactions were first introduced, system developers were forced to program transactions for different apps. But over time the way it evolved is transactions got buried into the runtime and application developers had to gestate in their code they had to just annotate a piece of code that says this should be part of a new transaction or part of an existing transaction and the runtime would respect that.

I think the same analogy has to be applied to remote replication, especially given the data governance laws that exist today. So for instance one should be able to do things like OK I have this application X that's going to run on this data infrastructure Y, but I want my data to be replicated across regions A, B & C. The underlying infrastructure must be able to adhere to those policies. You can have a totally different application that has a totally different replication set, again mandated by some kind of a governance law. You could say “I have this application that should be replicated across regions X Y Z,” and again, the infrastructure should be able to adhere to those policies. So it's more policy driven and it is extremely dynamic in nature and it is done at an application level granularity.

You talk about region, that region usually is associated with cloud providers. Do you also...

No, I use that term very loosely. So the way you want to map what typical enterprises... When you have different physical sites, they would say you need multi site replication capability. When you go to a cloud provider, you should think of each region as a different site, so it's just a different term. But what what is needed is multi site replication capabilities naturally provided through the infrastructure.

And if we think about the data management more in general again not just data services as we always thought about this replication or snapshot, but it becomes more data management. Okay. So you think about data: where it should be stored and other things like retention probably can't seem to play this thing, so everything that defined the policy is of how to manage data and its placement and so on.

So, but at the same time, data management has different meaning for different people. And at least also in this case, now we talk about a snapshot, but actually also optimistic alternative theory for example is another data management feature.

So you have the ability to decide where to put your data if in a cheaper storage that is less accessible, meaning with more latency for example, but it doesn't have anything to do with compliance or something like that. In your broad experience both with storage and databases, what does data management mean for you? Or better: what does it mean for you in the context of data storage?

That's a very good question. But you're very right, it's a very broadly and very loosely used term of which some of the things that you've talked about are applicable. For instance, if you want to build the data for performance reasons, perhaps between hard disk drives or spinning media and/or some form of SSDs, I mean nowadays SSD is also a term that is very loosely used, I mean you have PCI devices, NVMe, what have you.

But being able to move the data unbeknownst to the application between spinning media and these kind of more performant media could be perceived as data management, and it's the job of the infrastructure to do so. In addition to that, when you are looking at systems like perhaps the one that is from Hedvig, anything that is implicitly ‘scale out’ and is designed to run across you know hundreds or even thousands of nodes, you also want to take care of things like when one creates a volume out of the system and says I want to replicate it three ways—having the capability to make sure that the system is always going to try to honor to have three copies of the data in the system. This also includes in the event of hardware failures or disk failures or network failures or complete site outages etc., could also be deemed data management.

When you have a largely distributed cluster, when one wants more horsepower or more capacity, you typically roll in a rack of servers with some software on it to do the management, and the ability to be able to rebalance the data in order to take advantage of this newer real estate that is now available could also be deemed data management. So this is just some of the things that we believe should come under the umbrella of data management.

Just to take Hedvig as an example, you offer several protocols to access data on your system, both object -wise and and blocks. Different applications of course are taking advantage of this capability, but actually they manage data differently from each other.

You know especially objects, they have this rich metadata that is associated to each single entity that you store in the file, so then it enables searching and also augmentation of the data itself and so on. All of this pans out with the organization of the company. I mean do they take advantage of these functionalities right now? Is it still on the application side of things or is that storage becoming more core for these kind of things?

No, that's a fantastic question. And if you remember when I spoke about operational efficiency, I think this was a very key aspect of that too. If you look at the larger Internet companies: I'm talking about companies like Amazon, Google, Facebook etc. Even the larger cloud providers like Amazon, Azure, how do you think they have solved the data deluge problem? They do so by standardizing on an infrastructure. If you go into these companies, 90% of the applications that developers internally build runs right off the bat on an infrastructure as a service (IaaS) that's been built out for them.

Now I believe that is the key to achieving operational efficiency, the way you want to deliver that is you want to have one single platform on which one can get complete protocol consolidation. You need to be able to create "virtual desks" or "volumes," which applications should be able to consume either as a block device or as an NFS SMB share, or as an S3 bucket.

Now one of the key tenets to achieving that is to have a very rich metadata subsystem as part of your platform. Now that's where we believe most of the hard work comes into play; and different protocols have different metadata needs and knowing how to lay that out into a singular metadata subsystem is where all the IP goes into and delivering this enables the following for the enterprise. If you look at how they have typically been looking at storage or data management is they talk to a set of vendors for their SAN needs, they talk to a totally different vendor for their NAS needs. They're all trying to figure out what their object storage initiatives need to look like, as in this day and age, there's been tremendous disruption happening even in the secondary storage market where traditional backup targets are being thrown out and newer ones are coming in.

Now you don't want to go from one set of silos to a totally different set of newer silos. The way you can drive operational efficiency is if one could provide one system that can lend itself to all these different workloads either by changing hardware skew or policies in the system, that would be the best place to be in. Now since you've said ‘Hedvig,’ we believe Hedvig is the only platform that exists today that can deliver that capability right out of the box. And it's not just that, we have customers who start in one area and automatically see the capabilities and push us into other areas, which traditionally wasn't the case. So standardizing on an infrastructure is a key to achieving operational efficiency.

Yeah, you're talking about a storage infrastructure instead of a storage system. This is a concept that I really love in the sense that I usually associate this large infrastructure that I can siphon multiple workloads at the same time as well as multiple product lines. And it's just easier to manage.

And it is very good at scale but actually I have a few concerns for this kind of even if it's a scale out, and theoretically you can start small. The economics for smaller organizations maybe with a scaleout system that can do a little bit of everything, could be a problem, and at the same time, quality of service comes into play, which is very important.

I couldn't agree more. In fact when we... see that's why you never want to go, if you've done any large enterprise or even medium to large enterprises you never go in and tell them [to] rip out everything, throw it away and use this new one. That's the wrong approach. You start with a particular kind of a use case, which any vendor knows can be a perfect fit for them. It could be a big bang for the buck for them and for the customer.

Once you're able to go in there, showcase the capabilities, in our experience what we have seen is they automatically realize that this has a lot more capabilities and could be used in a lot more areas. They've also always been met with this doubt: “oh you know what you're saying is too good to be true,” but there is only one way to address that problem. Try it, try it for one use case and then you will see the value and we have done that with some really, really large enterprises that are now putting petabytes of data on us. And then once they see the value in one arena, they'll automatically start pulling us into other areas.

For example for a large (I can't mention the company), but for a large aviation company we now power the infrastructure for their entire internal cloud. They have their applications virtualized on perhaps every hypervisor that's out there in the market including containers. And now they've also started using us for object storage and for big data links. It need not be the same physical cluster. You could have multiple clusters for multiple needs, but the key is the way you operate and manage these environments is the same because you need to train your folks on just one system. And that's where the beauty comes in.

Sorry if I played a heavy advocate here, but there are a couple of more questions that come up. One is okay so you have this incredible infrastructure that can do all the protocols, but actually even if I think about object storage and containers, it's like night and day with containers that you need data for maybe sometimes seconds. You need a very fast space to generate new volumes for these containers, and then you have to destroy them almost instantly, a few seconds or minutes later, while object storage, stores data forever. So they are quite incompatible.

If I think about the workload, if I think about the infrastructure underneath, how do you solve these kind of problems?

That's a very good question. Let me answer that. It's going to be a long winded answer but you will get what you were looking for. When you're looking for something that is performant you will need, and I'm gonna keep it high level so that we don't get caught up in the weeds. Now if you remember I'd mentioned about data and the metadata associated with it and all this put together is what delivers to you a system, right? If you're looking for a performant environment, let's say you were running VNs or containers on top of a system x. Now for that you would want to have a cluster set up where perhaps you have SSDs in place, where auto tiering of data happens, where data moves from perhaps hard disk layer into SSDs and out of it, unbeknownst to the application.

Maybe you want your metadata to depend only on SSDs and you will probably use replication as a means of data protection, as opposed to erasure coding and things like that. If you look at object storage, it's typically a ‘write once, read many’ kind of an environment. Typically you need a reasonable ingress, not that many reads perhaps. And for that what you could do is you could keep your metadata also on hard disk drives, you probably don't need any kind of auto tiering or data.

Instead of using replication you could probably just use erasure coding because people typically put data there for durability rather than for performance so you could do 4-2 erasure coding or 8-2-8-3, what have you. There again the reads are going to become reasonably expensive because anytime you use erasure coding in a distributed system, you will always have to read from any number of different nodes to put the data back together. That's where CDNs etc come to help out a lot if it's in a large setting or perhaps you need some kind of caching if you want some kind of a reasonable read response rate. But those are all—that's why I mentioned, its change of skew in the hardware and change of policy. These are the two things that you typically need to look at while you're catering your infrastructure to the needs of the application. Does that make sense?

Totally, so you are talking about building storage profile depending on the application and depending on the use cases and then...

Exactly, that's why you need to be able to have this capability when you create these volumes. You want to have some kind of a notion of what the app is going to look like, and how it is going to consume you. It's kind of application of that infrastructure that becomes very key in order to deliver these kind of capabilities.

Well that's interesting because somehow to manage all this complexity you need to do other things. I mean one is the analytics because without analytics you can't get a grasp of what is actually happening in the backend or how the system is performing, if there is something wrong or not.

And on the other side then you need automation, I mean API because some of these systems really need to be automated. I think about Kubernetes for example, but there are a lot of other things that in a large infrastructure you need to automate, you know the ratio between the number of administrative petabytes, you can keep it down with these kind of things.

Absolutely. I mean the moment you say “software defined,” you need to have complete programmable capabilities for both the data and control plane. And again it's that program-ability on the control plane, that's what helps systems like Hedvig to integrate into vSphere, to integrate into OpenStack, to integrate into Kubernetes, to integrate into Mesosphere, what have you. And in fact the entire product is deployable, we are Ansible.

So if anyone wants to take what we have and tie it into a broader application framework and be able to automate deployments they can drive the whole thing through Ansible, and since every API that we consume internally in the system is exposable also via REST based APIs especially for control plane, you can build your own orchestration around the entire product and without that, it's a non-starter because if you think about the cloud, you go in there and you click a few buttons and everything is configured and provisioned for you. It's the same kind of semantics that these kind of systems have to bring to the table. And that's exactly what we strive to do.

Do you think that integration between systems like yours for example and the cloud is important? I mean, I meet a lot of companies that are already in the hybrid stage of their cloud life, and they are evolving to some sort of multi cloud, meaning that the hybrid cloud remains one of the subset, one of the cases that they are experiencing, but actually it's becoming not only on premises to a single cloud but on premises and multiple clouds.

Absolutely. I think that the lines between on prem and any cloud provider are blurred at this point. In fact I think even the lines between hybrid and multi cloud should be blurred; and we become more relevant than cloud environments, because if you look at cloud environments in general you can pick your any favorite. If you look at the fundamental storage primitives that they provide you, none of them provide any fault tolerance across availability zones or across regions, or forget across clouds that doesn't even exist, because there is no incentive for them to provide that capability.

One could deploy a Hedvig system that spans multiple regions of AWS and multiple regions of Azure and treat it like a single pane of glass that is available to the consumer and be able to replicate data across regions not only within AWS but also across AWS and Azure. In fact the new... there is a big push towards what is called and there is also a new phrase that has been coined in the industry now. It's called cloud adjacency [and] is the new ‘on premise.’

So when you have folks in co-location facilities provided by say an Equinix, you are a millisecond or two away from any of the cloud providers. Now all of a sudden deploying a fabric that spans on prem and any of these cloud providers with a system like ours becomes a reality. And you could use it like as if you were using an on prem environment without even being hit by the laws of physics, when it comes to latency from going from on prem to any of the cloud providers. So we make this even more relevant in those environments.

Yes somehow with this kind of approach, you avoid to create the cloud silos, meaning you get creating small data repositories that are accessed by a single or few applications in a single cloud, which is a scenario that we saw in the past in traditional IT.

I think you hit the nail on the head. I mean that's exactly what we [do], we liberate applications and make them more mobile to be able to move the app alone from one region to another or from one cloud provider to another, without being forced to pull the data along with the app. If you have set up the replication policies in the appropriate way using a system like Hedvig, application mobility becomes just point and click like semantics. It will become very, very easy for one to realize that.

Yes and again many of these companies act with simple use cases like disaster recovery. Not that disaster recovery is simple, but actually it's simple from replicating data to another location and have it ready.

Now with all the tools that we have in the upper layers, transporting an application from one cloud to another is becoming easier and easier. Data is gravity and if you don't think about it that heavily in the deployment stage of your infrastructure, that becomes a huge issue later.

Absolutely. I couldn't agree more.

Avinash, this was a great conversation I think, and I hope that we will be able to continue this online. So why don't you provide us Twitter accounts for your company and maybe yours if you have one, just that our audience can ask questions if they want to know more.

So please send the questions over to Hedvig Inc, @hedviginc, and we'll be more than happy to take any questions and see how we can help you out.

Just to wrap up the episode, what is the website for Hedvig?

It's www.Hedvig.io.

Oh that's great. Avinash, that was a great chat again. Thank you for your time today and I hope that we will keep the conversation going online. Thank you very much.

Thanks a lot, Enrico.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.