Today's leading minds talk Data Storage with Host Enrico Signoretti
Before co-founding Flexify.IO as CEO, Sergey Kandaurov started as a software developer and project manager in a telecom industry and grew all the way up to the directorial position in product management for an international backup and data storage company responsible for the B2B segment. Sergey has a unique combination of hands-on software development and management skills, and experience building multiple software products from a vision to business success.
Enrico Signoretti: Welcome to a new episode of Voices in Data Storage, brought to you by GigaOm. I'm your host, Enrico Signoretti, and today we'll talk about multi-cloud storage, cloud migrations, and how to avoid cloud silos. To help me with this task, I asked Sergey Kandaurov, co-founder and CEO of Flexify.IO to join me for this episode. His company built what I define in Multi-Cloud Data Controlling,one of my recent reports, [as] a tool that allows end users to use a single front-end for multiple object stores with data placement that is ruled by a set of user-defined policies. This allows us to create a level of abstraction that brings more freedom of choice and flexibility while avoiding potential lock-ins. Hi, Sergey, welcome to the show.
Sergey Kandaurov: Hi, Enrico, and thank you for inviting me.
Let's start with a brief history of yourself and the company.
Sure. I started as a software engineer a number of years ago and then grew up all the way up to director of product management for large international company in a storage space. At some point, me and my co-founders decided to create a company that would help enterprises to address modern cloud storages, specifically provide the lock-in and challenges migrating that within clouds, so in basically multi-cloud storage challenges.
We started with this prototype that we finished in 2016, demonstrated it on One Convention Hong Kong to number of customers, got positive feedback, and understood that our technological ideas were feasible. It took us about one and a half years to prepare the production version of the software, of the service, and we launched the product in the beginning of 2018 and have been improving it since then.
Okay, very good. Let me explain a little bit [about] multi-data controllers to our listeners. They are important, I think, because there are at least a couple of trends that we have to take into account today. On one side, we are moving from a first phase of cloud adoption with a single service provider strategy to hybrid and multi-cloud infrastructure. It's not always a strategy. I mean, sometimes there are contingencies or just a solution that we have chosen that are provided on different platforms. Again, this is something that is happening.
At the same time, we see this competition between clouds, cloud wars on services and prices. I mean, not just a lower dollar per gigabyte or a lower dollar per CPU cycle but actually more in general, the fact that these service providers are building a huge set of services. Sometimes, one is better than the other. I found a lot of organization, for example, using a cloud provider for the majority of the services that they need but actually starting using another one because there is a specific workload or an application that works much better in other clouds, and they can save money; they can have the results quicker.
All this lead to a series of issues, challenges, and you want your data close to the application. Dollar per gigabyte is now very low, so storing that in the cloud doesn't cost that much. Actually, egress fees can kill you. Latency can kill you, as well, because if you have to access data that is located remotely, you pay [for] a lot of CPU just to access the data.
At the same time, if we look at the reality, many organizations–the vast majority of organizations probably are still not multi-cloud or not even hybrid. They are quickly becoming hybrid, so they have a mixed infrastructure between the on-premises infrastructure and the cloud, but multi-cloud also for this kind of guys is becoming very relevant. I mean, they are–it is now in every conversation, and they are thinking about it.
Sergey, what do you think are the challenges for this type of organization? Do you agree with me on the introduction?
Yeah, absolutely, Enrico, I agree with you. Multi-cloud deployments have a number of challenges. First of all, we are talking about enormous amounts of data. We are talking about hundreds terabytes, thousands or sometimes hundreds of petabytes of data that needs to be managed, and it needs to be managed in multi-cloud environments in a unified way. Moving such amounts of data between clouds is not a simple task. Even if you are talking about cloud-native speeds, it may take quite a while. Don't forget about [05:46] and all the bad stuff that can happen in the process. It also needs to be properly managed.
Another challenge of multi-cloud storage is differences in APIs. While S3 API is default standard, not all cloud providers offer S3 API out of the box. For example, Azure Block Storage only has Azure Block Storage API, or Alibaba object storage system, Alibaba OSS, only has their own API. This also complicates data and applications migration between cloud and situations where you want to distribute specific application across multiple clouds.
The biggest challenge in multi-cloud storage and data migration is downtime that migration often implies. The thing is that when you start to move or copy data from source to destination, you usually [are] not able to stop changes to the destination. This way, after the data had already been migrated, it might have been changed in the source, and the changes need to be synchronized later on. A typical solution to this problem is to run an incremental synchronization, incremental migration since the initial bulk and long migration already finishes, but Flexify.IO introduces a new innovative approach to this problem by allowing [you] to combine multiple cloud storages into a single virtual unified named space that is seen to the application as a single storage that is just distributed between a number of clouds.
Just to recap a little bit, the challenges are as always happens, size of the data. Data has gravity, so it's very, very difficult to move huge amounts of data between clouds. It takes time just because there are the physics of loads that are not avoidable in any way. The first problem is that if you move data and you do it a copy, let's say that you need to move from Amazon to Azure for any reason, there is no way that you get it done in seconds if you have a multi-petabytes environment.
The other thing is you don't want to stop your application while doing these data movement. The second thing, once you have migrated your data, there is another issue that is probably the–a challenge on itself is that the fact that the two clouds use totally different APIs. They are not compatible at all. This means that it's very complicated to move not just the data but also the applications now. By providing this abstraction layer, you write the application once and you can do the migration on the back-end and of course, no matter the quantity of data you migrate, you can do it synchronously in the back-end because nobody sees the change in the front-end, right? Is that correct?
Yeah, so there are basically three main challenges with data size. It's necessary downtime and it’s differences in APIs.
There is another aspect to this problem because we're talking about Amazon and Azure. We mentioned this, too, but actually there are many other service providers and of course, different APIs as well. The other thing is that you can have the same issue if you want to migrate from on-premises to the cloud. Again, S3 is de facto standard also for on premises installations but sometimes you can have issues for scalability in the on-premises installation or maybe you don't want to spend too much on your internal infrastructure or expand, and not just infrastructure but the number of workloads you support, so you want to move data in the back-end without letting the application know, right?
Yeah, absolutely, Enrico. It's a lot of reasons why a company may want to add cloud, public cloud to their on-premise storage or to their private cloud. What we see as well is the trend of data reappropriation. When the company already moved data to the cloud but at some point, they realized that now there are a number of solutions that allow them to build scale out object storage in their own private data center. Here, we talk not only about data migration. Here is an important point is that all this multi-cloud, mix of hybrid cloud and multi-cloud data storage needs to be managed in the unified way and data migration between private cloud and public cloud within different public clouds must be seamless for the applications.
I totally agree with you, but there is a couple of important characteristics that a multi-cloud controller must have. One of them, I think, it's transparency. We are converting APIs, so we standardized on S3, for example. Then, I want to be sure to access my data back if I need. I mean, if I build this abstraction layer and I'm writing on an S3 on Amazon Block of storage and whatever else, I want to be sure that if I remove the abstraction layer I can access my data back again.
Yes, absolutely. One of the reasons why companies deploy cloud storage controllers is being able to avoid lock-in. It will be just very strange if they just trade lock into cloud providers to a key to specific data controllers. That's why it's very important property of their data controller not to have lock-in to itself. At Flexify.IO, we realize transparent architecture without changing the content metadata of the object but it's been stored in specific clouds. This way, it's always possible to change from using Flexify.IO to using clouds directly. Of course, in this case, you lose functionality, so Flexify.IO provides on top of cloud storage, but you still have all your data in the cloud. You still have your original objects accessible directly through the cloud.
What does it mean? Does it mean that you, yes, translate the protocol but you leave the data intact and I access concurrently the same data from within your application as well as directly–or do you need to maintain an index or a database of the joint bucket or whatever it is to make it happen?
That is a very important differentiator of Flexify.IO that we don't need to maintain any database or any catalog of user data. All the changes was done through Flexify.IO or done directly to the cloud storage, instantaneously visible to the application. It's possible to switch Flexify.IO on and off to access cloud directly or access through Flexify.IO, seeing the same data.
Such architecture also allows us to achieve virtually unlimited scalability because our components, our engines, are stateless–so they work independently from each other. They don't need to communicate to each other. We can deploy as much of them as needed and in as many geographical regions as needed while keeping all the functionality like in the smaller deployments.
You have a central control plane that distributes the policies but the single front-end access point is located close to the application in every cloud or on premises. They work together or maybe they don't even work together; they just interact through the control plane and give the end-user the ability to access multi-cloud transparently.
Yes, this is correct. The components that actually handles users' data, because it's component and engine. Our recommendation is to deploy engines as close to the data as possible exactly to reduce latency and eliminate egress traffic fees. Our control plane is basically a way to manage configurations of those engines in a centralized environment. The system can be deployed even without the control plane. This is basically a utility that simplifies centralized management of a number of distributed engines.
How do you manage scalability then? Can you have multiple instancesof the engine in the same cloud that's accessing the back-end, so to parallelize operation?
Yes, absolutely, because our engine's architecture is completely stateless, we have unlimited scalability. Another differentiator of Flexify.IO is that this scalability can be used both in processing application requests and migrations of data. Basically we can split migration among a number of engines, and each of the engines will migrate its own part of their task, its own objects independently from all other engines. This way, we can achieve really cloud-native speeds like over 40 gigabits per second, basically is the only limitations that we have is each cloud storage and cloud links itself.
How interesting. An end user can instantiate as many of these engines as he wants and then make operations in parallel, so move more data concurrently.
Yes, this is right, and it's property of cloud-native applications to be able to scale at any size. Flexify.IO is originally cloud-native and its original design to have unlimited scalability–it's already designed to have unlimited scalability, fit in demands of any size.
What is the trade-off in terms of latency when I add this layer?
Well, multi-cloud storage becomes to make sense when we are talking about large amounts of data, about hundreds of terabytes of petabytes of data. At such scale, managing database with objects or catalog with objects would be very challenging and would add latency that most customers most likely won't be able to tolerate. Architecture entirely depends on the cloud storage. We see cloud storage as a database. Cloud storage already gives us a map between object keys and object data. We utilize this idea in order to eliminate a need to manage any kind of database or catalog on our own.
This design helps with latency because by combining data from multiple clouds, we are in fact able to deliver to a customer, to an application, a version of the data or a copy of the object from a cloud that has the shortest latency as compared to their application engine and geographical location.
I see. One of the features that I like the most during the demo of Flexify.IO was the fact that you can combine multiple buckets from different providers in a single view so that the end user can access the S3 protocol a bucket that actually is dispersed on multiple clouds. It is somehow very useful if you want to consolidate your data for an application that needs to access it.
Yes, that is right. If you see multi-cloud storage not only in terms of being able to manage data in the multiple clouds, we see multi-cloud storage as a layer that can transparently combine data from whatever clouds or whatever cloud storage, the buckets, attached to the solution. In fact, when we are talking about multi-cloud management or multi-cloud data controller, there are at least two planes. One is management plane that allows a user or an aggregator to see some analytics and statistics about cloud storages. The second, probably the most important plane, is the data plane. It actually allows to combine data from multiple cloud into the single unified storage.
You mentioned AWS, S3, Azure Block, and because these are the two most common service providers, these are the two most common services in enterprise space. Do you support any other API or cloud?
Yes, we support most of the public cloud providers currently on the market, and we see some significant interest in Digital Ocean spaces; obvious[ly], Wasabi. Some customers also look at Google Cloud storage and at Alibaba. The thing is the cloud storage market becomes really commoditized, meaning intense competition and a lot of choice. Data owners, enterprises that owns the data, they like to have this choice, so they like to have the choice to place their data in a cloud, whether private or public, that fits their need for this data the best at this very moment.
Where can I find your product? You mentioned Digital Ocean, for example, but all of the providers that you mentioned that they have a marketplace, for example, are you available on this marketplaces?
Yes, we have a community edition that's available on Digital Ocean and Azure Marketplace currently. We are working on output to more marketplaces but for larger deployments, we recommend using our SaaS offering at Flexify.IO or working with us for customized installation in customers' own account or customers' own servers.
You mentioned a SaaS version of the product. Does it mean user interface, the management layer is hosted on a cloud and you pay per use and then the engine is deployed in a virtual machine or in [23:35] circumstance or whatever and you can have many of them?
Yes, that is right. SaaS version is the easiest for user to use. All user need to do is create an account, add storage, and click Migrate button. Customer don't need to think about location of engines or any other architectural issues. We just sold it to the customers. However at some point, there are trust issues or there are unique needs. With SaaS, will it be the best fit for this or that customer? In this case, we all support on-premise or customer-owned installation. For small projects when we don't need any kind of scalability or being able to run a distribute the system, those have offering it in marketplaces.
How does the pricing of the product work?
We usually charge per gigabyte for the SaaS solution. For marketplace offerings, that is per hour prices, and for on-premise installation or customer-owned installation, we are ready to sell special licenses or subscriptions.
It's pretty flexible.
Yeah, it's absolutely flexible.
At the end of the day, the pricing model looks very interesting for migration. That was the first use case that you mentioned at the beginning, right?
Yes. We see a number of requirements for the complete multi-cloud storage solution. First of all it’s being able to move data between clouds whenever data owner needs to move data between clouds. The second one is being able to combine data from multiple clouds. That allows for true multi-cloud storage as well as eliminates downtime during migration. The third requirement for multi-cloud storage solution is to be able to transpond over API so that the customer will not need to change application code whenever data applications are being migrated to the cloud that doesn't support, let's say, S3 API.
Yes, in fact, at the end of the day, we are talking about a pretty complex work. At the beginning it was migrating data to the cloud. Then it's cloud-to-cloud. Sometimes it's cloud back to on-premises for data reappropriation, so having a solution that allows you to do all of it transparently, seamlessly is now, I think, mandatory.
Even for those customer that I'm not still thinking in a multi-cloud way, I think they need to be aware of the existence of this kind of solution because they give much more flexibility. Even if they are not ready to adopt it today, if they use standard APIs, it will be easier for them to put this abstraction layer in the middle.
Absolutely. It's even more important to have flexibility rather than to use this flexibility. Even if the company, let's say, using Amazon Web Services is not thinking about–even if the company that's using Amazon Web Services is not planning migrating somewhere else right now, it is always good to have an option to migrate whenever the company deems necessary.
Okay, this is fantastic. I think it's time to wrap up the episode, but maybe we can give a few links about Flexify.IO and maybe–I don't know if you are on Twitter or LinkedIn but maybe somebody wants to continue the conversation online. Sergey, where can we find Flexify.IO? I mean, probably the name speaks for itself. Where can we find you on Twitter or LinkedIn?
Yeah, you're absolutely right. That is Flexify.IO and on this webpage, you can find our Twitter or LinkedIn profiles, which is also Flexify.IO.
Fantastic. Thank you very much for your time today and talk to you soon. Bye-bye.