Voices in Data Storage – Episode 2: A Conversation with Adrian J. Herrera of Caringo

:: ::

In this episode Enrico Signoretti talks with Adrian J. Herrera about on-premise object storage, the changes in performance for data interaction, and the benefit of storage tools with features as well as performance.

Guest

Adrian “AJ” Herrera has over 20+ years of experience bringing innovative storage, cloud and media software and services to market. Prior to Caringo, Adrian was on the founding team and headed marketing at Nirvanix, one of the first enterprise cloud storage services, where he helped to grow the company from pre-funding to having over 700 customers ranging in size from start-ups to Fortune 10 organizations. He has held Marketing, Business Development, and Strategy positions at Yahoo, Musicmatch and Xing Technology Corporation. He holds an MBA, specializing in Entrepreneurship, from San Diego State University and a BS in Information Systems from Chapman University.

Transcript

Enrico Signoretti: Welcome everybody, this is Voices In Data Storage brought to you by GigaOm.  I'm your host Enrico Signoretti and my guest for this afternoon is Adrian Herrera. He is VP of marketing at Caringo, a leading solution provider in the object storage space. Hi Adrian, how are you today?

Adrian Herrera: Great, Enrico, thank you for having me.

Thank you for your time. This is the first episode of Gigaom Voices in Data Storage, so I'm quite excited I would say. And for this episode, I wanted you in because I want to talk about the evolution of object storage in the last twelve to eighteen months. And we will also take some time to talk a little bit about the latest version of Swarm, your solution. As I said, Caringo is a market leader in object storage. Today, I was thinking a little bit about the evolution of object storage in the last year, year and a half, and I saw it like a race, I spotted this race, the more Cloud, the more personalized, and also a huge growth in the market, so more customers with several use cases that were unthinkable a couple of years ago. What do you see in the market from the Caringo perspective?

Yeah, I think I agree with all the points you pointed out. We've certainly been around a while, we've seen a lot of trends and every year we think it's a year for object storage to finally take off. I will say, this year I really think it is because of what you've just pointed out. So, starting with cloud, what we see as the ideal use case beyond the hype, we kind of see the real use case and then a use case that's getting a lot of hype right now. The real use case we see is utilizing the cloud for your cold copies or to satisfy some offsite requirement, because you can select any number of data centers anywhere around the globe and satisfy your off-site requirements, have your cold archive, your cold copy, somewhere where it's easily accessible.

And then the use case we're seeing that's getting a lot of hype right now is to utilize some of the cloud computing features like transcoding or some sort of artificial intelligence analysis, machine learning, that kind of thing, analyzing images, analyze video footage. We see a lot of potential in those use cases, and we think eventually, there will be a lot of organizations leveraging the cloud for those specific use cases, but today we only see a few organizations, at least a few of our customers doing that. The vast majority of cloud use cases are using cloud for offsite or disaster recovery.

I have to say that also in my experience, it's very easy now to have an on-premises object store that somehow replicates data to the cloud, so you don't have to buy a second object store on a secondary data store, especially for the small and medium sized company without a secondary data center. Also, because you pay only a dollar per gigabyte, which is very low for these cloud providers because it's a disaster recovery. So, it's a one way for most of the time, you don't pay egress costs. So, it's very convenient.

Yeah, absolutely. We agree with that.  The cloud is always advertised as saving money, and sure, it saves you money if you don't have existing facilities like you pointed out. But if you have existing facilities, then you should maximize the use of those resources: your data center, your power, your cooling. Sending it up to the cloud, you're going to be paying that, it's going to be hidden in the dollars per gigabyte fee, but you're going to be paying for it in perpetuity, forever. And those costs compound, and as you mentioned, egress fees are also something that organizations have a very hard time predicting and forecasting. I don't think they really have a good handle for how much data they actually access, and how cloud services charge. For instance, if you're trying to find a single file in a few hundred terabytes, how many calls is that really going to take and how much is that really going to cost you?

Exactly.  Part of my work in Gigaom is around the research on alternatives to Amazon S3, just because it's unpredictable. When it's really tough, not that they are not clear on the costs, it's just that it's impossible for you to have it clear, because when you start with an application, you don't know the success of the application, you don't know the data growth.  You can do some analysis, but when you're in production, you get the real numbers. So, sometimes it's really tough.

Yes, I agree. I think Amazon and Microsoft and Google have done a tremendous job and a service to everyone in the IT industry with what they have been able to do, it's a huge capital investment and infrastructure investment that not many organizations can do. But that being said, a lot of the value in cloud is on the compute side. On storage, you can't forecast capacity usage and be able to plan effectively to keep everything on premises, and you kind of mentioned ‘on premises,’ and that's really our focus. Even though we do tiering to the cloud and migration to the cloud, our focus is always going to be on developing the best on premises object store or scale out storage platform possible.

Also, to your point, some customers understand that it's better to have a copy on the premises and move data only when they need to, or sync data according to their computer needs. So maybe they have a data set that they want to move to the cloud for a big data application, get the results and delete everything, including the data, not just the instances.

Yeah, it's interesting, I was with one of our partners in the media and entertainment space last week and I was talking to a lot of their guys who are experts on the production side, and we always thought that the cloud was great for transcoding, it's like send up all your projects into the cloud and go ahead and transcode them and then send them back. And they sat me down, they said, ‘Hey Adrian, this isn't that realistic of a use case, because moving that large of a data set in and out of the cloud as efficiently as post-production houses and studios need it, it just isn't practical.’ But they did mention that rendering, when you're dealing with mathematical equations and you're dealing with putting frames together, sending those calculations up into the cloud is extremely reasonable and very practical. So I think the market still has a lot of education that needs to be done, and that's why I classified the compute side as hype.

This need of performance somehow comes from the cloud, but actually is a part of what I see on the premises, so there are more and more applications needing, for example, to be modified, especially in the IOT space, and these objects sources are becoming more and more efficient. Now, it's way easier to find object storage vendors talking about performance. Many years ago, was all about, ah, we give you the best dollar per gigabyte, we give you durability, we give you scalability, but performance was always off the table.

And the reason [is] I think there had to be a change in the discussion and performance. A few years ago, everyone was still talking IOPs, your traditional speeds and features in the storage world. Now I think organizations are a lot more comfortable talking [about] throughput, and that's really where object storage comes in. What we like to say here at Caringo, is there's a time and a place for all storage tiers, there's a time and a place for tier zero all the way through your archival tier. We think you need to select the right storage and the performance characteristics based upon your business requirements.

That being said, not all object stores are created equally. So you really need to take a look at what your requirements are, what your use cases, the hardware that's running it and you need to test everything out within your environment. And one thing that we like to point out is you really need to take a look at the architecture that different vendors are putting in place. For instance, a lot of vendors in the object storage space are using some sort of front side cache, and then their object store behind the scenes, and a lot of the times when they're measuring performance, they're just measuring the right to that cache, and not the complete right all the way to the object store.

At Caringo, we take a pure object storage approach, we don't have a front side cache, so whenever you see performance figures from us, it's riding all the way down to the object store. I just like to point that out. We do have some pretty impressive performance figures that we recently got in a very large deployment. One is 35 gigabytes per second read S3 throughput.  That's on aggregate in a very fast network. To put that in perspective, that's close to parallel file system, read throughput on one end, and it's also as fast as some kind of near line options, like NBME. That being said, we're not replacing either of those solutions. There's a time and a place for both of those.

Well, actually, when I think about performance, object stores, I also think about the fact that a few years ago, objects store was only pull, get, delete. Now we have tons of meta data, meta data searches. All these integrations that are popping up now...you had one of the first integrations with Elasticsearch for example, but the idea of integrating external elements that are particularly faster in searching huge amounts of data -- meaning getting your answer immediately -- it's not that you’re not archiving workload anymore, but even if you store pictures, you want to access them because sometimes you want a collection of this picture that has a similar characteristic, and you want them immediately.

That's exactly it. I think the conversation with object storage, initially always went to archive use cases. And what we're seeing now is, it's almost like archive or active archive needs to be redefined as online. I mean object storage is really online storage. I would classify tape more as archive now, cold archive. But there are a lot of use cases now in this new on demand world and distributed workforce world where your content always needs to remain online and accessible.

It's no longer acceptable to keep your primary archive copy on tape. It just takes too long to recall, there's too many manual processes to call back that information.  It needs to really remain instantly accessible and searchable. And, as you mentioned, not just by the object store, but by an ecosystem like the ELK stack, Elasticsearch logstack search and Kibana. That way you plug it into visualization applications, your computing application, so we are believers in that approach.

When you talk about performance you need more performance because you have more application insisting on the objects there. So again at the beginning, it was just archiving, and I'm sure that you as many others, started with partners and customers asking just for repositories, maybe a larger repository, but that's one application accessing the objects. Now, we have so many S3 compatible applications working at the same time on the objects, that it's really important to get performance out of the system.

Yeah, absolutely we agree with that. And I think one of the biggest misnomers for object storage is the fact that it just has storage in its name. What most people don't realize is, when you talk about object storage solutions, there is an integrated web server, there's integrated load balancing, a lot of the times there's integrated tenant management, there's an integrated content management. These are all things that aren't necessarily present in different tiers of storage.

I have to say that I totally agree because I started to reference object storage as an infrastructure and not a system. I always talk about object storage infrastructure. So you find me very aligned with your way of thinking here.

And what it really is, it's a private cloud, and it's like standing up your own Amazon S3 within your data center. And because of that, you start to think, okay, what's going on underneath the scenes because of performance requirements to satisfy what's going on. Yeah, you mentioned a lot of applications, but it's about the efficient movement of data throughout the entire infrastructure, and that's why aggregate throughput is so important, because you are moving very large datasets around for analysis or to different tenants, or you're using some sort of parallel upload to ingest data in a very rapid way and also distribute data in a very rapid way.

And a few examples of that, again M&E, Media and Entertainment examples: we have service providers who need to utilize their bandwidth to every bit possible, because that's how they make money. They make money by bringing projects in, working on them on tier zero storage and then they need to bring them back to the object store and distribute them in a very efficient way, again utilizing the maximum bandwidth as possible, that's available to them. So object store has kind of shifted at least from our perspective, from a capacity and scale perspective now to the efficient utilization of resources, and a lot of the times that's a network.

Also, I think that the huge growth in adoption from end users, it's also led from the fact that there are more solutions now. So it's easier to think about objects storage as a possible target for many applications now.

Yeah, absolutely. And I think this may segue us into the next discussion, but again, this is where Amazon I think has done the whole industry a great service, because a lot of organizations, a lot of applications, a lot of storage solutions are now leveraging the Amazon S3 API. And what that's done is unlock the whole ecosystem for organizations to utilize different types of storage. And we're seeing that wide adoption, and I think a lot of object storage vendors, what we've all done is come out with our own interfaces for POSIX compliant file systems. So, I think we all have our solutions for SMB and NFS, and I think we're all going to continue to innovate there. It's just about making the transition to object storage, and new restful workflows as seamless and as easy as possible for organizations.

That's a good introduction maybe for what you are doing with the option on top of your object storage. At the beginning, I mentioned that you have a new version of your object storage, but actually I'm quite impressed from a couple of options that you already mentioned somehow, that our FileFly and Swarm NFS, both from the performance point of view, but also from the features. I don't know if you want to introduce them a little bit, just to give the listeners an idea of what they do.

Yeah, definitely. First I’ll level set: Swarm is our core object store, that's what we call our core object store. About two years ago, we came out with a product called Swarm NFS, and what Swarm NFS is, it's scalable object NFS for sustained data streaming. So, let me say that in layman's terms: it's a way to go from the POSIX world, NFS world, and stream data into the object world, into Swarm in a sustained way so that we're not using any sort of file system emulation, which a lot of other organizations do -- they use Samba, they use FUSE. Those solutions are great to a point, but they don't really scale well and they create namespace silos.

What Swarm NFS is doing, it's actually converting data on the fly from NFS to object. And you mentioned performance before. In a recent deployment, we were actually able to get 1.6 GB per second of sustained streaming, this is without cache, without spool, this is sustained streaming to the object store. And this is a single instance, again this is a stateless converter. Sodollar if something happens, if your Swarm NFS instance goes down, it doesn't matter. Spin up another one [and if] you need more than 1.6 GB per second, add another instance. This is some of the value that we bring as Caringo, utilizing the underlying parallel infrastructure where all core object storage nodes do the same thing. That underlying architecture can support this this type of ingest. And if you extrapolate that 1.6 GB per second, that's over 3 petabytes per month of data transfer into the object store per single instance.

It's quite impressive. But FileFly is a different story, right? It's not only about performance.

Yeah, FileFly again, just to make it easier to understand, it's more similar to HSN. We're targeting Windows filers and NetApp filers. And what you can do, your data on the Windows and NetApp filers, you're able to migrate to Swarm based upon policies, and you base the policies based on any sort of file system attribute that you have on Windows or your NetApp. So: file size, file type, owner, last access, that kind of thing, you can move that data and leave a stub in the file system, and with this latest release, FileFly 3 (well, historically you could only do that to Swarm), but with this latest release, you can now do it to Amazon S3, you can do it to Google, you can do it to Azure. So now you can take your data from your Windows or NetApp filers and send it to any major cloud or Swarm, there's no longer a dependency on Swarm.

That's really nice. It talks of a lot of possibilities for your customers of course.

It does. And FileFly is actually an application that runs on Windows server. So let's say you have your data up in the cloud, and for some reason your main FileFly server went down, you can actually spin up an image in the cloud with a Windows server and install FileFly up there, and rebuild everything. So it's resilient in that way, and that's our kind of vision for disaster recovery and cold copies that we spoke about earlier.

But you also had an announcement about algorithms, not just software, right?

Yes, we did. And we're really excited about this. Again, when object storage came out, the initial conversations are always in petabytes. Everyone's talking about how many petabytes do you have, or we could scale to. I think I heard zettabytes and one point, this is a huge unimaginable scale. And the truth is most organizations don't have petabytes of data, only, only a few, if you take a look at all the organizations out there. A lot of the times you talk to object storage vendors and they say, ‘well, okay, the initial deployment is you need four servers, and these are what the servers need to do and this is how you start with it and it could scale to hundreds of petabytes.’ And that's just a non-starter for the vast majority of organizations out there. So, we took a look and said, okay, well, what can we do to solve this?

And again, because of the uniqueness of our architecture, how we don't rely on cache and we don't rely on metadata databases and other things that other solutions have, we're able to put our entire software stack in one box and we're calling that the single server appliance. And if you take a look at what it is, it is just a single box, it wasn't designed to be compact. It was really designed to offer the price/performance characteristics needed to have your own private cloud within your organization.

It delivers 60 terabytes of usable capacity; 96 terabytes of raw capacity and we're targeting it at very specific use cases in the M & E sector. There's a lot of post-production houses and small to medium use cases [with] remote offices. So if you do want to have your archive data and S3 accessible data in a remote office, you can go ahead and stand one of these up.  Very easy to expand, just plug them in together and make sure they can see each other on the network and you increase both the compute resources and also capacity, but you also get everything we spoke about earlier.

There's integrated content management, there's integrated tenant management. There's the ability to share files via a URL in a public or private way. One product that we have that wasn't part of the Swarm 10 launch was Caringo Drive, but Caringo Drive just installs on any Windows PC or Mac OS, and it allows you to mount Swarm as a regular drive. So with that you can drag and drop files from laptops or pcs to this single server appliance, which is your on-premise private cloud. And everything's included in a box.

This is really nice, especially because for this kind of a customer, probably also the cloud tiering is a good option for disaster coding instead of adding another data center or another location. So if you get the full Swam 10 box, that's a really nice solution.

Yeah, absolutely. And again, you don't have the normal barriers to entry that you do with other object stores. A lot of the times it's software based, all vendors are software based. So you have to talk about software, then you have to talk about hardware, and you have to go and procure the hardware and then find some time to install the software, and it's a lot of processes that you have to go through to evaluate most object storage solutions.  With this, you just plug in a box, everything's pre-installed, everything's preconfigured. You plug in a box and off you go.

It is very good indeed, but do you have a link if our listeners want to learn more about Caringo, Swarm 10 or the singles server Swarm solution?

Yeah, absolutely. They could go to our website, caringo.com. Everything is advertised on our homepage. It's very easy to find. I do recommend going to our Resource section and signing up for our webinars or viewing our library of webinars. We spend a lot of time on our webinars. We get a lot of our technologists, our developers doing webinars. We have a great series called Tech Tuesday and one Tuesday of every month we do a deep dive on specific parts of object storage technology -- stuff like how to evaluate hardware. We had one about how to migrate data. Just this week we had one that goes into how you run object storage on a single server. So that's something that I would also like to point out. And of course, everyone can always follow us on Twitter and you can follow us at @caringostorage.

That's great. Adrian, thank you again for your time today and I hope to keep the conversation going in the future.

Likewise, Enrico, and Gigaom thank you so much.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.