Who Protects Your Cloud Data?
Back in April, we speculated about one of the hidden dangers of depending on web services to store your data: the possibility that no one was doing backups. Now that possibility may have turned to reality for users of Omnidrive (once touted as the “clear leader” in the online storage field by TechCrunch). The service has been offline for some days, with its servers currently not responding at all. A December article at ReadWriteWeb contains serious allegations of fraud from the company’s ex-CTO (as well as a defense from the CEO).
My sympathies at this point are with Omnidrive’s users, particularly those who have their only copies of documents on an unreachable server. I can think of plenty of times when a days-long outage (let alone a permanent loss) of my own document storage would be devastating. The larger question, though, is what you as a user can (or should) do about this? Online document storage is certainly attractive to the web worker; being able to access and share your work easily in any browser is definitely a killer feature. But how do you balance that off against the fact that your documents could simply vanish overnight?
One possible approach is simply to choose your storage vendor very carefully. Backup vendor Mozy, for example, is owned by giant EMC, Jungle Disk uses your Amazon S3 account for storage (so your data will be available even if Jungle Disk itself goes under), and Google Documents is, well, Google. Some smaller vendors have their own serious backup policies to guard against hardware failures.
Yet in a world of imperfect hardware and software, as well as regulatory and legal issues, choosing one company for storage is still ultimately a gamble. It may be unthinkable that an EMC or Amazon or Google could fail, but it’s not impossible. No matter how carefully you choose, entrusting your data to a single online storage vendor is the equivalent to storing it on a single hard drive: it introduces a single point of failure into the system.
For hard drives, of course, we’ve long had several answers to this problem: backups or RAID. If disks are unreliable, make a copy of the data elsewhere. If one disk is unreliable, store your data on three or five or seven disks, with a scheme that allow perfect data recovery even if one or two disks should suddenly be reduced to iron filings by hardware failures. What the disappearance of Omnidrive suggests to me is that it’s time for the next step in the evolution of online file storage, now that there is more than enough competition in the market for simple storage. We need the online equivalent of backups and RAID.
This doesn’t mean that the online storage services need to use backups and RAID on their servers; that’s irrelevant to me as a consumer in providing protection against vendor failure. Rather, I’d like to see products that automatically back up, say, a Box.Net account to Amazon S3 storage. Or an API that writes copies of my data simultaneously to Amazon and the fabled GDrive, and allows retrieval from either service if the other is missing. Or even a way to mirror my online storage, overnight, down to a desktop drive for safekeeping.
Until products like these are available (and if I’ve just missed them, please let me know in the comments), storing your documents online will remain a gamble. Perhaps a safe gamble, but it could be made far safer with more vendor independence.
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
This is the Achilles’ Heel of online data storage. Once it’s off my computer (or my network) I no longer have control over my data.
Irate ex-worker decides to take a few servers down on the way out the door? Local internet provider suffers a service outage? Even planned maintenance downtime can interfere with our ability to retrieve our files and get our work done. And if we depend on web-based apps we can lose the very tools needed to work with our documents.
As convenient as online data storage and web apps can be, they introduce too many (uncontrollable) points of failure to be relied upon as a primary solution.
“Cloud RAID” is a good idea and I would certainly look with interest at someone offering such a solution.
One thing to be very careful about, however, is which vendor is supplying the physical backend storage service for the multiple RAID providers.
Why? The company you are paying for your web hosting is probably outsourcing the actual hardware infrastructure to another (larger, more efficient) back-end supplier. This will increase in future as economies of scale lead to fewer, but larger, utility computing centers.
In the “Cloud RAID” model there must be careful control that you are not contracting with what appear to be a number of independent RAID storage providers, but when we look at the physical implementations it turns out that one or more of these providers is actually being hosted by the same back-end storage utility.
We would then be running same risk as before! One hardware failure (or network outage) could lead to data loss. What is worse, though, is that by using the RAID approach we have a false sense of security about our data.
I have been using Jungle Disk for offsite backups since November 2006 without any problems. I am confident in the reliability of Amazon S3 for this. Even so, it would be nice to have at least some idea of:
(1) How many separate copies Amazon is storing of my data
(2) Where my data is (roughly) located.
At the moment my offsite backups are being stored somewhere in the “Amazon cloud”. Where, exactly, I have no idea …
So far, Amazon S3 has been most reliable.
ElasticDrive allows you to configure a “cloud raid”, where data can be written to several remote storage systems at the same time (S3, Nirvanix etc,) as well your local disk. Check out http://www.elasticdrive.com
You don’t need to look at a relatively small vendor like OmniDrive. Remember the repeated outages Salesforce.com had a few months back? This is mission-critical information for a lot of people.
That said I would still argue that keeping your data on the cloud is many times safer than on your local hard disk.
Perhaps we need services such as Pingdom that measure and rank the various cloud storage providers in terms of reliability and up-time.
YDRIVE will let you select your own and/or preferred backend storage provider. Available soon.
There’s a difference between a backup storage provider and one where you’re creating the master copies of your documents in the cloud. If the online storage service is merely being your backup provider you’ve lost nothing if they go away – you just have to find a new online backup service. A hassle, maybe, but the risk is minimal for a week or so if you don’t backup and there are several other options out there.
If you’re creating master copies of your documents in the cloud… well you STILL need to have a backup strategy. Not so much because your files might be stored on one drive and lost, but because they might become inaccessible. It’s the exact same issue as backing up local data – your data is all in one place, what if that place suddenly is not accessible?
I use Mozy as well as daily (or more often) backups to a removable disk drive, stored onsite. Doubtful (though not impossible !!!) that the computer,onsite backups, and offsite backups are all inaccessible at the same time.
Using online storage providers for primary storage is asking for trouble. Even if it is backed by a large company, priorities change and that large company may decided to close down the service. I use online storage providers to store encrypted off-site backups and nothing else.
This issue applies more broadly to Web 2.0 applications in general. If my business relies on a web-based service, what happens if that service goes out of business? Always create a contingency plan and always create your own (local!) backups of any data stored on the web.
Amazon doesn’t disclose a lot of internals about S3 for security and competitive reasons, but they have stated before that all data is stored in at least 3 different datacenters in at least 2 geographic areas (e.g. east/west/central). They are pretty serious about data security as well as availability.