Blog Post

Google’s Spanner: A database that knows what time it is

Google (s goog) has made public the details of its Spanner database technology, which allows a database to store data across multiple data centers, millions of machines and trillions of rows. But it’s not just larger than the average database, Spanner also allows applications that use the database to dictate where specific data is stored so as to reduce latency when retrieving it.

Making this whole concept work is what Google calls its True Time API, which combines an atomic clock and a GPS clock to timestamp data so it can then be synched across as many data centers and machines as needed. From the Google paper:

…Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to con?gurable garbage-collection policies; and applications can read data at old timestamps. Spanner supports general-purpose transactions, and provides a SQL-based query language.

Because of the importance of the True Time API, Google has GPS antennas and atomic clocks on the servers in the data centers running Spanner technology.  The approach is also fairly unusual, but Google’s innovations have a way of spreading once they are publicized.

For the full walk-through on Spanner, Google’s paper delves into the specifics. Here are a few tidbits to help determine if Spanner is something you’d care about.

  • Spanner automatically reshards data across machines, and it automatically migrates data across machines and data centers to balance load and in case of failures.
  • This makes Spanner good for high availability as well as applications that need a semi-relational database that handles reads and writes faster than Google’s Megastore option.
  • Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general-purpose transactions.
  • Spanner’s data model is not purely relational. Rows don’t need names but they must have an ordered set of one or more primary-key columns familiar to people who work with key-value-stores. The primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns. The paper says imposing this structure is useful because it lets applications control data locality through their choices of keys.

Spanner is cool as a database tool for the current era of real-time data, but it also indicates how Google is thinking about building a compute infrastructure that is designed to run amid a dynamic environment where the hardware, the software and the data itself being processed is constantly changing.

11 Responses to “Google’s Spanner: A database that knows what time it is”

  1. truetime is what is really going to make shit change in the long run. solid date time support across timezones, systems, and everything else. no more date time shenanigans is heaven.

  2. Scott Sommer

    Anyone who has done modelling in a real time system will understand the importance of having data timestamped to at least millisecond accuracy, and will also understand that when you are timestamping data on multiple servers that each machine has time creep and they all need to be synchronized somewhat fanatically in order to ensure your data’s timestamps are accurate.

    While this may seem like overkill from the outside, it’s the bare minimum for a distributed system that takes real-time data seriously.

  3. Sounds like an interesting combination of similar functionality in MongoDB (2.2) which introduced the ability to control which shards data sits on and versioning of documents in Cassandra. It’ll be interesting to see if this gives birth to an open source project to implement it, although the need for hardware atomic clocks makes it more difficult. I wonder if NTP based time sync would be sufficient for most non-Google-scale setups.

  4. This is very clever work, but unless that Data Center is mobile, I doubt the location changes appreciably. Maybe just put Lat/Long in a config file perhaps?…or even better, just use the Location API in Google Maps because it can pin me down to a 2500 foot radius just based on WiFi networks in the vicinity and I don’t think the time varies that much in half a mile.

      • Suman Srinivasan

        Right, time synchronization is a really important feature for large distributed systems, and its a pain to get right. This might look like overkill, but it guarantees consistent performance and I’m not surprised Google put in this “overhead” to make sure that the time is synchronized without human intervention.

      • TheXenocide

        “Guarantees” is a bit of a strong word, considering there are a fair number of things that can interrupt GPS, but I agree. Plus, GPS and atomic clocks are cheap hardware and they build their machines en masse so it’s probably not “overhead,” it’s probably cheaper than the man hours to configure them. (Hope this doesn’t repost; the login was not intuitive)