Vladimir Sedach

Have Emacs - Will Hack

March 9, 2009

Facebook Cassandra database overview part 2: algorithms and data structures

This is the second part of my examination of the Cassandra distributed database. Part 1 described the problem that Cassandra is trying to address; this post will describe the distributed systems techniques that it utilizes to that end.

The basis for this post is a presentation describing Cassandra by Prashant Malik, Karthnik Ranganathan, and Avinash Lakshman given to the Facebook staff and available online: http://www.new.facebook.com/video/video.php?v=540974400803#/video/video.php?v=540974400803 (hosted on Facebook, an account is required for viewing). James Hamilton provides a summary of the presentation on his blog.

Underlying Cassandra is the technique of consistent hashing. Using consistent hashing, the key space can be modeled as a circle of the keys' hash values, and nodes assigned to points on that circle, so that a node is responsible for an interval of the circle of key hashes between itself and its next neighbor. Nodes leaving or joining only affect their previous neighbor, and lookup has a very low communication cost. Due to its great simplicity and robustness compared to other mechanisms, consistent hashing is probably the most significant development in distributed systems in the 1990s - virtually all current peer to peer systems are based on it.

Cassandra uses a gossip protocol for determining cluster membership and for failure detection (and for load-balancing, discussed in the next blog post). Failure detection in this context means the various failure detector mechanisms that can be used to achieve consensus in asynchronous distributed systems with unreliable nodes. Given unlimited time to finish a task, it is impossible to distinguish a dead node from one that is slow - failure detectors use timing assumptions and heartbeat messages between nodes to provide an answer (with some degree of uncertainty), something that would otherwise require a synchronous network between nodes. Cassandra implements a failure detector based on Hayashibara et alia's The φ Accrual Failure Detector; the presenter points out that some "ghetto" failure detectors with simpler mechanisms do not work very well in practice.

Each interval of the hash circle is replicated on a set number N of nodes. Cassandra provides a choice of replication scheme: optimistic replication, which gives eventual consistency, or stricter approaches for writes and reads: quorum write and sync on read. Optimistic replication allows writes even if all nodes that are responsible for storing the particular key are down: the node receiving the request logs it and later hands it to the appropriate nodes when they come back up. Under the quorum write scheme, a write is not successful until at least a specified x number of the N total replicas have performed the write. It would seem that if x=N, then we get both monotonic read and write consistency. Sync on read synchronizes all replicas on each read request, which gives monotonic write consistency.