[qi:020][qi:026] As we increasingly struggle to manage our data spread, both on our devices and in the cloud, even Google is going offline, for remote access and online storage aren’t enough. There is a need for sync with local duplicates, one the likes of SugarSync, Dropbox, […]

[qi:020][qi:026] As we increasingly struggle to manage our data spread, both on our devices and in the cloud, even Google is going offline, for remote access and online storage aren’t enough. There is a need for sync with local duplicates, one the likes of SugarSync, Dropbox, Apple’s MobileMe and Microsoft’s LiveMesh are aiming to fulfill.

I worked at SugarSync for almost three years (I no longer have any financial ties to the company), so I know the double-edged sword of sync firsthand. When you sell a sync product, you sell magic. (We free the data from their physical devices; just forget where you last edited the file, it’s gonna be everywhere.) But once it’s implemented, there’s no magic anymore, and the engineer is left to deal with asynchrony, slow bandwidth, third-party applications, and file systems that have different semantics.

Here’s why:

Push sync is deeply asynchronous

Conceptually, you have redundant copies of your data on various devices, and a service that keeps the copies in sync. That means that when a change is made on one device, you replay the change to the others. So when a file is edited on one device, it needs to be updated on the others. In the meantime, you can’t guarantee that the old version of the file remains untouched on the other devices. It could be edited, moved or deleted, which is when conflicts arise. It’s well known that concurrent programming is difficult; sync is just an extreme example.

Sync matches different data models

You can’t actually sync identical duplicates of your data because they live on different devices. So you have to translate the data to the local models. File names alone don’t sync properly from a Mac to Windows without careful unicode transformations, so imagine what becomes of extended attributes, resource forks, ACLs, etc. Even if you’re not cross-platform, most file systems can be set to either case-sensitive or case-insensitive. So you have to come up with an extensible strategy to deal with the different models, and there is infinite testing involved.

Sync messes with third-party applications

Applications tend to misbehave in various ways with their documents. When a sync product attempts to update a file with a newer version from another device, it can’t always know whether or not the file is open for reading or editing. In that case, the application may become unstable. Syncing app data is also dangerous.

Sync is hard to test

Sync maintains redundant copies of the user’s data through incremental updates. The devices are in sync as long as the redundant copies are consistent. Developers and testers will usually assume in their testing that the initial state is in sync; they will make a change on one side and see that the other side changes accordingly. So as soon as an error is introduced in the system (you’re out of sync to start with), you’re in a non-tested scenario. The system needs to recover from its own errors.

Several of these problems are on the client side, which is why building a sync client is hard — even on the best sync platforms out there, such as Sync Services or Live Framework.

Some of these difficulties are just inherent to sync, but the technology is maturing. Sync so far has meant something different to everybody. Even the industry’s main players still sell byproducts of sync rather than sync itself, which then compete with backup, online storage, photo-sharing or music-streaming services. Sync is a great enabler for all these connected services, one that’s becoming central to the personal cloud story of more and more companies.

Jean-Gabriel Morard is a software engineer living in Paris.

  1. Timely article as I was just bemoaning the slow progress of Live Mesh.

  2. Richard Farleigh Sunday, May 10, 2009

    You can solve this all quite easily.

    Don’t have multiple places to store your data, just have one cloud/server type storage that all your devices can access.

    Eliminate local storage and store everything in the cloud.

    1. To go along with the trust and privacy issues associated with all your data “in the cloud”, there is also the real problem that bandwidth, whether it be cable/DSL, 3G, 4G, etc, is nowhere near ready to replicate local load times, particularly as it related to working with images and video.

    2. Yes, This works! I am developing a software having a centralized server storing all the data. This data can be accessed from mobile or web or desktop.

  3. We solved this problem by doing away with sync altogether. Version control is a far better way to go. We have developed a way to version contacts and tasks, therefore eliminating the need for sync between Outlook and the cloud. We have also integrated this into our new iPhone application which is being sent to Apple this week. Please check us out at http://www.cosential.com

    1. @Dan Cornish Version control is actually a very interesting approach. It’s a mature technology, it’s very safe with a very granular control left to users. But it has its limits as well: it pushes the complexity of dealing with conflicts onto the users. While version control is perfectly adapted to advanced users, it isn’t a great fit for people who are mostly looking for a seamless productivity tool.

  4. i am using rsync for last 4 years no issue

  5. [...] Why Sync Is So Difficult — 18:30 via Google [...]

  6. As far as I can tell, Dropbox has actually solved the desktop file-sync problem. It’s the first sync product of its kind that I’ve used that works out-of-box.

    Yes, sync is hard. Where I think things fail is mainly around user experience. You just have to make smart decisions and let users recover if they see something they don’t expect. Dropbox does this quite well.

    I learned this when I was at Microsoft on the ActiveSync team — we actually were trounced by RIM not because our sync was worse (it was probably superior) but because we initially made the mistake of OVER-reporting status.

    Sync should be a silent, no/very little UI experience — a utility that just works in the background. Any attempt to make it more than that will cause the product to fail miserably.

    1. @Garry Tan I totally agree that users should not need to know that we’re solving these problems for them. And ideally no UI is what you’d like but when people rely on the sync product to push a file onto a device before they take the road, they also want to know that it made it there before they can turn off their computer. So some feedback is needed. File sync is also expensive in bandwidth and in CPU and you need to account for that to the user. It doesn’t have to jump at the screen but it should be here for reassurance to users who do want to know.

      Again- syncing files just makes it more likely to run into any of the problems that you may have when you sync PIM data. Indeed you can assume that PIM-data sync is almost always carried through and through; but file sync very often is interrupted (because transferring files across the network is slow!) and people will then wonder where there data is. Communication is important in that case.

  7. don’t forget about Livedrive, they seem to have syncing down to a fine art. Works perfectly

  8. Hello,
    Great article I just wanted to add that SpiderOak Inc (https://spideroak.com) offer a FREE (2GB) Online Backup and Smart Sync solution for both companies and consumers.

    SpiderOak is available for both Linux, Mac and Windows and incorporates 100% zero-knowledge online backup,sync, storage, access and share.

    Try out SpiderOak today at https://spideroak.com

    Daniel Larsson
    SpiderOak Inc

  9. @Richard Farleigh wrote: “Eliminate local storage and store everything in the cloud.”

    And when the backhoe accidentally cuts multiple fiber cables, suddenly you have no access to your data. FAIL.

  10. @Richard Farleigh You’re right that a remote-access based system works around these problems.

    Online storage is great if you live in a fully connected world with unlimited bandwidth, or with a small number of small files, so that you can always find a network access and download them on demand. But as soon as you try to actually solve the problem of the multiplication of devices, where people really want r/w access to all their files all the time, it’s too limited. Also with remote access-based solutions, users have to think ahead and upload the files they then are going to need on the road.

    One could come up with a hybrid system, such as remote access with a cache. If you cache some of your files locally then you solve some of the network issues (flaky, slow or no network.) But if you give people write access to their files in their cache then you introduce the problem of merging the local and the remote edits when the user goes back online, and you’re back in sync land.


Comments have been disabled for this post