Voices in Data Storage – Episode 1: A Conversation with Paul Speciale and Stefano Maffulli of Scality

:: ::

In this episode Enrico Signoretti talks with Paul Speciale and Stefano Maffulli about cloud storage, multi-cloud, and the benefits and challenges facing cloud users today.


Paul leads Product Management for Scality, where he is responsible for defining RING functionality, solutions and roadmaps. Before Scality, he was fortunate to have been part of several exciting cloud computing and early-stage storage companies, including Appcara, where he was focused on cloud application automation solutions; Q-layer, one of the first cloud orchestration companies (the last company acquired by Sun Microsystems); and Savvis, where he led the launch of the Savvis VPDC cloud service. In the storage space, Paul was VP of Products for Amplidata focused on object storage, and Agami Systems, building scalable, high-performance NAS solutions. Paul has over 20 years of industry experience that spans both Fortune 500 companies such as IBM (twice) and Oracle, as well as venture-funded startups, and has been a speaker and panelist at many industry conferences.

Stefano is Director of Community Marketing at Scality where he is leading the efforts to bring Zenko, the open source multi-cloud controller, to developers around the world. Stefano built his career around Free Software and open source: from pre-sales engineer and product manager at Italian GNU/Linux distribution MadeInLinux to Italian Chancellor of the Free Software Foundation Europe, where he also created the FSFE Fellowship participation program. Later as community manager of leading mobile open source sync solution Funambol, he boosted downloads and bolstered enterprise contributions. For Twitter, he led the expansion in the Italian market, recruiting TV stars and astronauts to tweet. Under his watch as community manager, OpenStack became the fastest growing open source project. As Director of Cloud Marketing at DreamHost he successfully managed new product launch and increased the adoption of the cloud products. In his spare time, he teaches sailing in the San Francisco Bay.


Enrico Signoretti: Welcome everybody, this is Voices in Data Storage, brought to you by GigaOm. I'm your host, Enrico Signoretti, and today we'll talk about multi-cloud storage and multi-cloud data controllers. My guests for this episode are Paul Speciale, Chief Product Officer and Stefano Maffulli, Director of Community Marketing at Scality. Hi guys, how are you today?

Paul Speciale: Doing great. Hi Enrico, this is Paul here, it's a pleasure to be with you and looking forward to this chat.

Stefano Maffulli: Thank you. It's great to see all of you.

As I mentioned before [about] multi-cloud storage, there is nothing wrong with cloud storage, it works great, there are several options around every single provider as object storage for example, as well as other forms of storage. But when you have applications running on different clouds and you need to access the same data or the same data subset, everything has become a little bit more complicated, hasn't it?

Paul: Yes, I totally agree, Enrico. So, we've seen this trend now for a decade, where the big cloud vendors are starting to offer so many services. It's not just storage, although we all know that Amazon started this ball rolling about 10 or 12 years ago with Amazon S3. But now they have, I think over 90 services for compute, and of course that presents a lot of problems. You need to have the data near the application or the service, you're going to have some costs if you aren't careful about reading the data out, of course the performance in the latency, you want to be kind of co-located. And then I hear people saying, well it's not just about one cloud, it's like you said, it's about Google and Azure and all these other clouds that are coming at us, there's so many choices. So, you don't want to be locked in, but you really want to look carefully at what are sort of the use cases.  What are you trying to achieve, are you trying to get some kind of low-cost storage or do disaster recovery? Let me say that we hear 10 or 12 use cases in almost every multi-cloud conversation we have.

So, just to dig a little bit deeper on this, so latency is probably one of the biggest issues, right, especially now you have this niche players sometimes with the wonderful services, or even in the big players. I was with a customer the other day and they told me, look, we really like our Microsoft Azure, who does big data analytics, but our data is in another cloud. That gets complicated because we pay a lot for the egress costs and looks like they don't talk very well together with each other, so we get a lot of latency. So, this could be one of the two major points as you mentioned, so latency and egress cost. I don't know if you see more of these issues in the field.

Paul: It's very familiar to us, because even as an object storage vendor, of course the conversation always turns to performance, latency is a very important aspect of that. I think people looked even at object storage in general as something for high latency or non-latency sensitive applications years ago. And it's no less true with cloud, you have analytic services, you have compute burst, now we see media companies that want to do things like transcoding and rendering in the cloud. Clearly this is a place where you can't have the data be across a wide area network or hundreds of miles away.  The performance is going to cost too much to do these interactions across a high latency link. So it's very much part of the conversation.

And what about data protection, because sometimes there are these nice, very cheap cloud service provider that offers S3 compatible storage for example, but they are not available all across the world, so you maybe want more than one copy.

Paul: Yeah, absolutely. So, I think when we started hearing about the idea of using the cloud and using it in combination with on-prem storage, one of the first people’s thoughts was, let's just use these low-cost services right? There are cold storage services, they look very appealing, in terms of cost on paper. There is a bit of a hidden cost there, if you need the data out later, there's going to be egress charges and some SLAs, but I think this is very important for people -- protecting their data. And let me say that one trend we see is there are customers that don't even have a second physical data center. You might be talking about a hospital, it's not a professional kind of IT data center, and for them the thought of using the cloud for some kind of backup or DR purpose is very appealing. They just need to understand what they're getting themselves into in terms of performance, and ultimately the TCO, it's not just about the per gigabyte, per month cost.

Yes. So, you introduced another variable in this big scenario, which is on-premises installation that I have to talk with the cloud somehow. You mentioned this thing about the data protection and disaster recovery, but actually the same goes with on-premises installation. If you want to run a big data workload, it's not really the case, and most of the time, you don't want to install Hadoop in your infrastructure. It's very expensive and maybe you don't need to run jobs every day.

Paul: Absolutely. We've talked to so many customers now, let's say in bio informatics, they want to do these drug simulation runs. Clearly, they don't want to have thousands or even tens of thousands of CPU instances standing by for these occasional runs, they're probably fairly regular, but it would be to some level underutilized. We hear it now in media rendering, they used to do this all on-prem, now they're starting to burst into the cloud to have this kind of spillover capacity. So I think there's this combination now of data protection to make sure that data is durable in the scenario of both on-prem and cloud, but to really look at the fact that they want to minimize the infrastructure costs. So, I don't want to reproduce a huge Hadoop cluster on-prem if I don't have to.

From my point of view, there are a couple of challenges when you deploy a multi-cloud storage, one is the metadata management, because you want to be sure that this layer answers the same way between several clouds. And the other one probably is to have the tool, to have a management system that understands what is available. There are several API's, there are several different tiers from different service providers, so there is a complexity to manage all these things. 

Paul: Yeah, there's no doubt about it. We've been looking at this for years now and let me say that in the object storage world, we were all pretty happy that we finally adopted Amazon's S3 as sort of a de facto object storage protocol. I think in the world of NAS, we always had NFS and SMB, we all felt good that those were standards. Now it's not case, even though we have this very popular one. You're right, the different clouds have different dialects, they have a different API, it's a complexity. I wouldn't want my team of developers to have to learn two or three different dialects. So, I think that's one complexity that has to be neutralized. But you're right, just having visibility -- where's my data, how do I find it? These are the kinds of challenges that are going to present the minute you have data in two or three places, you're going to incur this complexity.

Let's introduce a little bit of your work in this regard. I know that you developed it for years now, so you started a long time ago to develop some open source solutions, and one of them is Zenko, right?

Paul: Yes, that's right. We started the thinking, I would say that the input from the customer started coming at us probably three years ago. The cloud has been around a while and people were saying, there's the emergence of these services that they want to use. But I would say in reality, about 2 years ago, people started saying it's not just going to be about on-prem anymore, it's going to be about intelligently using the things that I have on-prem and also the things that are coming at me in the cloud. We mention this idea of having some extra capacity in the cloud for compute or for some kind of application processing. This is really the request that came to us discreetly and I would say that that's what formed the idea behind our Zenko product. It's an open source and enterprise data management controller for multi-cloud, but you're right, we really did start it in the open source.

You mentioned the magic word, ‘multi-cloud data controller.’  What does it mean in practice? 

Paul: So, I can start and then Stefano can chime in a little bit here about the open source angle. So, multi-cloud controller is really all about simplifying the use of not only your on-premises storage, but one or more public clouds. So, the idea is to intelligently manage data that lives across these different areas and give you the visibility and sort of unifies them from a presentation perspective. One of the things we wanted to do though, was to make sure that it was shaped by our users and our community, and that's really what sparked this idea of developing it in the open source. So maybe I can have you, Steph comment on the open source.

Stefano: With pleasure, thank you Paul. So, the idea came from customers, they were giving us comments and requests, and therefore the very simple solution was to work together with customers and potentially users, early adopters, to not only get the code running, but also get their defined specification early on. And Zenko has been developing since the very first input in GitHub as a participant project. It's a cloud native application that is made of multiple components and micro services, and each of them has value by its own.

So, one of the most popular and the very first micro servers that we released as open source is cloud server, and it's the S3 API compatibility layer that is capable of translating calls from S3 protocol into Google Cloud and Azure Cloud and any other S3 compatible storage solution. So, right now, for example, we support also Wasabi and Digital Ocean and many others, and of course, the ring S3 connector. And from that, then we add the other services like the metadata ingestion service that adds to the system that allows you to get visibility of where your data is being stored with what metadata. So you can search and find any object, whether it's stored in Google or on-prem, on the ring or on S3 bucket, you can find everything into one unified API end point.

And everything is deployed on any kubernetes cluster published on the charts, so it's very easy to get started. And on top of the open source engine, we have also developed a configuration management tool that we call Orbit, that you can use to connect your instances of Zenko to this software as a service (SaaS) offering that is part of the enterprise edition of Zenko, and you can manage all of your data through that.

Let's try to recap a little: Zenko is a multi-cloud data controller that can be installed on any computer cluster. So it's sort of micro services, that includes also the cloud server, that you use to develop the first component, and thanks to other micro services, it is possible to have a complete vision of all the data because you have metadata searching capabilities, for example, and you can give your customer the exact positioning of every data on every cloud, right? 

Paul: That's right. And a little bit about the metadata management which you made a point of earlier, Enrico: the tough part here is indeed managing all this metadata, and ultimately that's what Zenko is all about, it's taking in these API calls that Steph mentioned through the cloud server, but sort of mapping them so that we keep a namespace, a view of all these different clouds of the metadata. And, really the trick here is to do separation of the metadata from the data. So, what Zenko does is, it keeps track all of the system metadata, your object names, their locations, their size, who is the owner -- but it decouples that from the physical storage location which could be in any of the clouds we talked about. So, I think that's kind of the heart and the hard problem that had to be solved as part of Zenko.

And you said, this is an open source project, so everybody can go into the repository and download it, compile it and then start a new environment.

Stefano: Absolutely. And more than that, we also do a design in the open. So we have a repository where we keep all the design improvements and the proposals for improvements, we have them discussed openly among our architects and developers. So, we have two major pieces being discussed right now, and one is a workflow manager, which is basically another micro service that allows to do operations on top of/based on the metadata. So, every file, every object that has known to Zenko can be processed and analyzed differently based on metadata.

So, we have this vision that we want comments on and we want people to help us design this feature. Another new micro service allows the ingestion of metadata from the file systems, so any NAS based application service can be plugged into Zenko, and Zenko can ingest the metadata from those files through a mount point that is NFS or Samba. And this is being discussed, we have also a prototype, we're probably going to demo this prototype very soon, like as early as next week.

That's really interesting, especially the metadata augmentation part, if I understood well, so the data and metadata augmentation you were talking about, right?

Stefano: Absolutely

I think it's one of those features that can really change from a storage platform to a data platform, [from] the point of view of the user and the way you design your infrastructure, of course. Do you also have customers in production with Zenko? Do you have use cases for this platform?

Paul: Yeah, it's been a very interesting journey on that front, as I mentioned that a couple of years ago, people were starting to express this need, but of course, they weren't ready to shape it into exact use cases. I would say now that that's happening more and more, and honestly customers can think of wide variety of them. But what we hear is sort of a few central themes. One of the themes is that the data needs to move near the service, it needs to move near the processing. One of the overtones that that implies is that the data must always be stored in the cloud’s native format.

So, an important aspect of making these use cases work is that when we store the data in Amazon or in Azure or in Google, it can be accessed, it can be read natively by applications and services. Let me give you one discreet use case: in the media space, these customers would like to start using the cloud for content distribution. What does it mean, content distribution? It means that [when] they capture their video assets, they need to transfer them to one or more public clouds, and then they use the compute capabilities for example, to do transcoding or they use the native CDN, the content delivery networks... All of these clouds offer some form of that to do the actual distribution of the content to the end user. But what that implies is that all of the data is always usable at every stage in that workflow, so this media workflows is one of the things that helped shape that.

And I would say the other thing we hear now is this disaster recovery and just protection of your data. For customers that are used to using one cloud, they don't want to be susceptible to that cloud being down or unavailable, even if it's for a few minutes or an hour, it costs them in their business. This is also true for on-prem, you can't have a data center be inaccessible and shut down your application or your service, so having more than one copy of the data comes up a lot as a use case. But I would say that we're very focused now on this M&E, media and entertainment workflow aspect.

Especially because these kinds of organizations have a lot of data. 

Paul: They have a lot of data -- we're hearing about multi-terabyte video files that need to flow. And it's very interesting, if you talk to them about why do you need multiple clouds, because truthfully, you could do it with one cloud. What you hear are arguments about services are certainly better in one cloud, for example, Azure, now has video index, or that's something that's very appealing to them. They just want to avoid lock in, they want to have a freedom of choice, that's really the ultimate reason they want to take advantage of multiple clouds. And I think some customers are getting smart. They realize that putting data into the cloud is often free (ingress), but the reading out, the egress is what costs, so they feel like they might as well put it there and then figure out if they can use it intelligently.

And you mentioned that in this discussion, that Zenko, has also an enterprise edition, so there is not only this open source part?

Paul: Yeah, there's two parts. So, the first part is that we introduced the open source edition first in middle of 2016, now about a month ago in September of this year, we did a launch of the 1.0 in both the open source and as an enterprise edition. The two are actually the same product. The differences come in in terms of how we license them, so maybe Steph, can you comment a little bit about the open source from a licensing perspective?

Stefano: The open source software is licensed under the Apache version 2 software license, and it's available for anybody to use. It can be easily deployed in a kubernetes cluster, like I mentioned before, and it can be operated easily through configuration locally. The alternative is to connect it through to the Zenko administration software as a service called Orbit. Put the ID of the cluster inside Zenko's Orbit and the two that are connected, and configuration and management can be done in a visual fashion, and it can have a very easy to use dashboard where you can see your replication between different clouds, how it's going, how much data is being consumed from one place to another, and do the searches graphically, do the metadata searches and make the whole experience a lot more easy than running through the command line interface.

And the switch between one and the other depends mostly on how much data you're going to manage through Zenko. We have a concept of total data managed. And it's going to always be free to use the Orbit administration for less than one terabyte. Over one terabyte, we want to talk to you. We don't have set pricing already set, we just want to offer a license for if you have more than one terabyte of data being managed through Zenko's orbits.

Paul: And let me chime in, so the licensing model, one of the things that we realized here is being more in the cloud now we needed to be closer to what people understand as cloud pricing, and that really means going away from the traditional enterprise software licenses to a subscription. And that's the model with Zenko. So, if you are interested in licensing Zenko as the enterprise edition, as Steph said, we have the idea of TDM, Total Data Managed, and there's a subscription associated with that. So, an easy way to think about it is (let's use round numbers) if you have a petabyte in Amazon and a half a petabyte in Azure, you have 1.5 petabytes of TDM, and then there's a monthly charge associated with that on each terabyte. But it's really done now more on a subscription basis, and you always have visibility into how much your TDM consumption is.

It is very nice, and also, at the same time very aligned with the expectation probably because they want a ‘pay as you go’ model for all these kinds of services. So, we talked a little bit about the architecture, the features, as well as the licensing.  I’m just curious to know if you can share with us a little bit more on what is going to happen next, so what are you working on?

Paul: Yeah, absolutely. So, Steph and I can kind of combine on this one. I think one of the key things about Zenko that we made a determination on early is that it's agnostic to the underlying cloud or storage solution. So, Steph mentioned the fact that, today we support ring and a few public clouds, but that is certainly not the end of the story. One of the things we want to do is expand the number of public clouds that we support.

We're getting demands almost weekly for [name your cloud] -- some of them are regional, some of them are more specialized vertical clouds, but certainly the idea of expanding the backend clouds that we support is one part of the strategy. Zenko is also usable with or without the Scality ring, so it is not mandatory to use the ring with Zenko, and as Steph mentioned, we also have plans to support other object stores NAS systems. So, I would say very strategically one of our aims here is this idea of increasing the breadth of what we support in terms of storage repositories. Steph, you also mentioned the workflow manager, so the idea of more customized workflows is something that we're hearing about more and more. Is that right?

Stefano: Right, it's something that we have heard from many of our customers. They're already doing operations on file, so we may as well help them have the operation triggered by simply changing a tag in a file. So picture for example, a system where you can say, as soon as you see this file changing from being work in progress to ready for production, then take this file, transcode it from one format to five different formats, and then drop different formats in different clouds with different buckets with different CDN properties, for example. And that's something that, right now can be scripted on ad hoc basis, but if it could be done directly inside the system that is agnostic, then applications can become a lot more lean and easy to maintain.

That looks very much like a sort of messaging system and maybe a serverless computing framework on top of it. Am I right?

Stefano: Yeah, something like that, in terms of functionality, it's another API that you can call and hook your system and have those dual operations based on logic and on triggers.

Paul: Yeah, exactly, the philosophy here is that a workflow is really a set of data inputs and outputs connected through a series of actions, and if we can allow the user to insert their own custom logic or their custom actions, that's ultimately an easy way to express a workflow, and we're even thinking to provide some kind of visual editor capability for that.

This discussion was really interesting, but I'm sure that our listeners are curious to know more now. Do you have any link from which I can start learning about Zenko and communicate maybe or where I can reach you guys if I have any questions?

Stefano: The main contact point is the website - zenko.io, and from there you can reach the community sites - forum.zenko.io. And we have the product projects on github.com/scality, and inside that, there is a Zenko cloud server and the other components of course, the helm chart, it's all in there. And also, code to do original coding very fast. We have the RFCs, the Request for Comments, improvements to Zenko itself. Lots of ways for people to get involved with the community.

Great. Thank you very much again for your time today guys, and let's keep in touch to learn more about Zenko and this evolution. 

Stefano: Thank you.

Paul: Thank you very much, Enrico.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.