11 Comments

Summary:

Google, which is notoriously secretive about technology advances, has opened up the vault and spit out a research paper on its Spanner database. And like other Google innovations, this one is hot. It’s a database that scales to millions of machines and trillions of rows.

clocks

Google has made public the details of its Spanner database technology, which allows a database to store data across multiple data centers, millions of machines and trillions of rows. But it’s not just larger than the average database, Spanner also allows applications that use the database to dictate where specific data is stored so as to reduce latency when retrieving it.

Making this whole concept work is what Google calls its True Time API, which combines an atomic clock and a GPS clock to timestamp data so it can then be synched across as many data centers and machines as needed. From the Google paper:

…Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage-collection policies; and applications can read data at old timestamps. Spanner supports general-purpose transactions, and provides a SQL-based query language.

Because of the importance of the True Time API, Google has GPS antennas and atomic clocks on the servers in the data centers running Spanner technology.  The approach is also fairly unusual, but Google’s innovations have a way of spreading once they are publicized.

For the full walk-through on Spanner, Google’s paper delves into the specifics. Here are a few tidbits to help determine if Spanner is something you’d care about.

  • Spanner automatically reshards data across machines, and it automatically migrates data across machines and data centers to balance load and in case of failures.
  • This makes Spanner good for high availability as well as applications that need a semi-relational database that handles reads and writes faster than Google’s Megastore option.
  • Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general-purpose transactions.
  • Spanner’s data model is not purely relational. Rows don’t need names but they must have an ordered set of one or more primary-key columns familiar to people who work with key-value-stores. The primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns. The paper says imposing this structure is useful because it lets applications control data locality through their choices of keys.

Spanner is cool as a database tool for the current era of real-time data, but it also indicates how Google is thinking about building a compute infrastructure that is designed to run amid a dynamic environment where the hardware, the software and the data itself being processed is constantly changing.

You’re subscribed! If you like, you can update your settings

  1. This is very clever work, but unless that Data Center is mobile, I doubt the location changes appreciably. Maybe just put Lat/Long in a config file perhaps?…or even better, just use the Location API in Google Maps because it can pin me down to a 2500 foot radius just based on WiFi networks in the vicinity and I don’t think the time varies that much in half a mile.

    1. They’re using GPS for its accurate time base, not for location.

      1. Right, time synchronization is a really important feature for large distributed systems, and its a pain to get right. This might look like overkill, but it guarantees consistent performance and I’m not surprised Google put in this “overhead” to make sure that the time is synchronized without human intervention.

      2. “Guarantees” is a bit of a strong word, considering there are a fair number of things that can interrupt GPS, but I agree. Plus, GPS and atomic clocks are cheap hardware and they build their machines en masse so it’s probably not “overhead,” it’s probably cheaper than the man hours to configure them. (Hope this doesn’t repost; the login was not intuitive)

  2. Sounds like an interesting combination of similar functionality in MongoDB (2.2) which introduced the ability to control which shards data sits on and versioning of documents in Cassandra. It’ll be interesting to see if this gives birth to an open source project to implement it, although the need for hardware atomic clocks makes it more difficult. I wonder if NTP based time sync would be sufficient for most non-Google-scale setups.

  3. Anyone who has done modelling in a real time system will understand the importance of having data timestamped to at least millisecond accuracy, and will also understand that when you are timestamping data on multiple servers that each machine has time creep and they all need to be synchronized somewhat fanatically in order to ensure your data’s timestamps are accurate.

    While this may seem like overkill from the outside, it’s the bare minimum for a distributed system that takes real-time data seriously.

  4. I do not get it. When you require tables to have keys, they become relations; thus, Spanner should be more relational than SQL, not less.

  5. truetime is what is really going to make shit change in the long run. solid date time support across timezones, systems, and everything else. no more date time shenanigans is heaven.

  6. Good for google to disclose some of its “vault” of information. Google has been silent with its technological advances, it’s their right of course. But sometimes it’s best if little details will also be shared with the people. I agree, that is a very clever work.

    http://www.bio-office.com/

Comments have been disabled for this post