Tuesday, March 27, 2018

An analysis of the strengths and weaknesses of Apache Arrow

In my previous blog post, I discussed the relatively new Apache Arrow project, and compared it with two similar column-oriented storage formats in ORC and Parquet. In particular, I explained how storage formats targeted for main memory have fundamental differences from storage formats targeted for disk-resident data. There was a surprising amount of activity surrounding the post --- the post received 28,000 visits (making it the 7th most popular post on my blog all time), and 86 comments in a HackerNews thread discussing the post. Given this clear interest of my readers in Apache Arrow, I would like to take a deeper look into the project in this post, and present my analysis of both some specific decisions that were made regarding the format itself, and also my personal experience with installing and running experiments on the code base.

A quick caveat before we begin: Many of the comments on the HackerNews thread revolved around a back-and-forth between fans and contributors to the Apache Arrow project who went ballistic when they read my title with a sarcastic tone (the title was: “Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?”) and more thoughtful and thorough readers who tried to calm them down and explain that the entire post was there to explain precisely why it makes sense to have Arrow as a separate project. However, one common point that was brought up by the pro-Arrow crowd was that my post was narrow in the sense that I only looked at Arrow from the perspective of using it as a storage format in the context of database and data analytics engines, whereas Arrow, as a general standard for representing data in main memory could also be used outside of this space. I should have been clearer about the scope of my analysis in that post, so this time around I want to be more clear: the scope of my analysis in this post is solely from the perspective of using Apache Arrow as a storage format in the context of database and data analytics engines and tools. I limit the scope to this context for two reasons: (1) I predict that the majority of Arrow’s use cases will be in that context (where I define data analytics tools broadly enough to include projects like Pandas) (2) As someone who has spent his entire career as a database system researcher, this is the only context in which I am qualified to present my opinion.

What exactly is Apache Arrow?

Arrow’s homepage self-describes in the following way: “Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.”

In other words, the creators of Arrow envision the project having impact in three ways: (1) as a development platform, (2) as a columnar memory format standard, and (3) as a set of useful libraries.

In practice, the majority of the code in the Github repository at the time of my interactions with the codebase was for constructing, accessing, and testing data structures using the Arrow standard. So my analysis in this article will just focus on the Arrow standard and the set of code that is provided to help in the implementation of this standard.

Is it even possible to have everybody agree on a data representation standard?

For decades, it was impossible to fathom that there could be a standard representation for data in a database system. Database systems have historically been horribly monolithic and complex pieces of software. The many components of the system --- the storage layer, the transaction manager, the access manager, the recovery manager, the optimizer, etc. --- were significantly intertwined, and designed assuming particular architectural choices for the other components. Therefore, a “page” of data on storage was not a simple block of data, but also contained information critical to the recovery manager (e.g. the identifier of the log record that most recently wrote to this page), the transaction manager (e.g. timestamps required by multi-version concurrency control schemes), access manager, and so on. Each unique database system had different concurrency control schemes, different logging structures, and different index implementations; therefore a page of data in one system looked vastly different than a page of data in other system.

Therefore, if you wanted to move data from one system to another one, you would have to “export” the data, which involved rewriting the data from the complex page format stored inside the system to a simpler representation of the data. This simple representation would be passed to the other system which would then rewrite the simple representation into its own proprietary standard. This process of rewriting the data before export is called “serialization” and rewriting it back before import is called “deserialization”. Serialization and deserialization costs when transferring data between systems have historically been necessary and unavoidable.

Over the past few decades, the database system world has changed significantly. First, people stopped believing that one size fits all for database systems, and different systems started being used for different workloads. Most notably, systems that specialized in data analysis separated from systems that specialized in transactional processing. Systems that specialized in data analysis tend to either be read-only or read-mostly systems, and therefore generally have far simpler concurrency control and recovery logic. Second, as the price of memory has rapidly declined, a larger percentage of database applications fit entirely in main memory. This also resulted in simpler recovery and buffer manager logic, which further simplified data representation. Finally, as open source database systems started to proliferate, a greater emphasis was placed on modular design and clean interfaces between the components of the system, in order to accommodate the typical distributed development of open source projects.

All of this has lead to much simpler data representations in main memory and analytical database engines, especially those in the open source sphere. Fixed width data types are often just represented in arrays, and variable-width data types in only slightly more complicated data structures. All of a sudden, the prospect of standardizing the data representation across main memory analytical database systems has become a realistic goal, thereby enabling the transfer of data between systems without having to pay serialization and deserialization costs.

This is exactly the goal of Apache Arrow. Arrow is, in its essence, a data representation specification --- a standard that can be implemented by any engine that processes data in main memory. Engines that use this standard internally can avoid any kind of serialization and deserialization costs when moving data between each other, which several other blog posts (e.g. here and here) have shown to result in significant performance gains. 13 major open source projects, including Pandas, Spark, Hadoop and Dremio have already embraced the standard, which I believe is enough critical mass for the Arrow standard to become ubiquitous in the data analytics industry. Even if existing systems do not adopt the standard for their own internal data representation, I expect they will at least support data exports in Arrow. This increases the motivation for any new main memory analytics engine being designed to adopt it.

While ubiquity is usually a good indicator of quality, there are plenty of languages, APIs, and pieces of software that become ubiquitous for reasons entirely unrelated to their quality. For example, the SQL interface to database systems took off due to the business dominance of the systems that used SQL, even though there were arguably better interfaces to database systems that had been proposed prior to SQL’s take-over. Furthermore, even high quality things are often optimized for certain scenarios, and yield suboptimal performance in scenarios outside of the intended sweet spot. Therefore, I took a deeper look at Apache Arrow without any preconceived biases. Below, I present my analysis and experience with playing with the code of the project, and discuss some of the design decisions of Arrow, and the tradeoffs associated with those decisions.

Columnar

It would be easy for someone who sees Arrow’s self-description of being “columnar” to mistakenly assume that Arrow’s scope is limited to two dimensional data structures that have rows and columns, and that by being “columnar”, it is possible to derive that Arrow stores data column-by-column instead of row-by-row. In fact, Arrow is actually a more general standard --- including specification for one-dimensional data structures such as arrays and lists, and also data structures with more than two dimensions through its support for nesting. Nonetheless, we have all become accustomed to interacting with data through relational database systems and spreadsheets, both of which store data in two dimensional objects, where each row corresponds to an entity, and each column an attribute about that entity. By being columnar, Apache Arrow stores such two dimensional objects attribute by attribute instead of entity by entity. In other words, the first attribute of each entity is stored contiguously, then the second attribute of every entity, and so on.

This columnar representation means that if you want to extract all attributes for a particular entity, the system must jump around memory, finding the data for that entity from each of the separately-stored attributes. This results in a random memory access pattern which results in slower access times than sequential access patterns (my previous post discusses this point in more detail). Therefore, Arrow is not ideal for workloads that tend to read and write a small number of entire entities at once, such as OLTP (transactional) workloads. On the other hand, data analytics workloads tend to focus on only a subset of the attributes at once; scanning through large quantities of entities to aggregate values of these attributes. Therefore, storing data in a columnar fashion results in sequential, high performance access patterns for these workloads.

Storing data column-by-column instead of row-by-row has other advantages for analytical workloads as well --- for example it enables SIMD acceleration and potentially increases compression rates (my previous post goes into more detail on these subjects). The bottom line is that by choosing a columnar representation for two-dimensional data structures, Arrow is clearly positioning itself for adoption by data analytics workloads that do not access individual data items, but rather access a subset of the attributes (columns) from many data items. Indeed, many of the open source projects that have been built natively on Arrow, such as Dremio and NVIDIA’s Open GPU Accelerated Analytics (GOAI), are focused on analytics.

Fixed-width data types

Note that attributes of entities tend to have a uniform data type. Therefore, by choosing a columnar data representation, Arrow can store columns of two dimensional tables in an identical way to how it stores data in one dimension of uniform type. In particular, fixed-width data types such as integers and floats can simply be stored in arrays. Arrow is little endian by default and pads arrays to 64-byte boundaries. Aside from the extra padding, Arrow arrays store data in memory in an equivalent fashion to arrays in C, except that Arrow arrays have three extra pieces of metadata that are not present in C arrays: (1) the length of the array, (2) the number of null elements in the array, and (3) a bitmap indicating which elements of the array are null. One interesting design decision made by Arrow is that null elements of the array take up an identical amount of space as non-null elements --- the only way to know if an element is null is by checking to see if there is a 1-bit in the associated bit for that element in null-bitmap that is part of the metadata for the array. The alternative design would have been to not waste storage at all on the null elements, and instead derive the location of null elements by inspection of the null bitmap. The tradeoff here is storage space vs random access performance. By expending space on null elements, the nth element of the array can be quickly located by simply multiplying n by the fixed-width size of each element in the array. However, if the null elements are removed from the array (and their location derived from the null bitmap), the amount of space needed to store the array will be smaller, but additional calculations and bit counting must occur before finding the value for an element in the array at a particular index. On the other hand, sequential scans of the entire array may be faster if the system is bottlenecked by memory bandwidth, since the array is smaller.

Since Arrow’s design decision was made to optimize for random array element access, I ran a simple benchmark where I created an array of size 100,000,000 32-bit integers, put random values in each element of the array, and then searched for the value at 50,000 different locations in the array. I first tried this experiment in a regular C array that allowed null elements to take up an identical amount of space as non-null elements (similar to Arrow). I then tried a different C array where nulls take up no space, and a high performance index is used to speed up random access of the array . I then installed Apache Arrow, built the same array using the Int32Builder in the Arrow C++ API and accessed it through the Arrow Int32Array API. I ran these experiments on an EC2 t2.medium instance. The results are shown below:




As expected, the version of the C array where nulls take up no space was much slower than the other options for this random access workload. Even though we used a high performance index, direct address offset calculations are always faster. Accessing data through the API that comes with the Arrow codebase was slightly slower than accessing data from an array directly. However, this is not because of the Arrow data format itself. When, after building the Arrow Array, instead of accessing the array through the Arrow API, I instead accessed a pointer to the raw data in the array, cast it as a const int*, and then proceeded to access this raw data directly in C, I saw equivalent performance to a normal C array. This cause of the slowdown from accessing data through the Arrow API is presumably from the C++ compiler failing to inline all of the extra function calls (despite the -O3 flag). I therefore conclude that for applications the are extremely performance sensitive, it is better to access raw data created in accordance to the Arrow specification than to use the API to access the data. But for most cases, using the API will be sufficient. As far as the decision to allow nulls to take up space, that was certainly a win for this random-access workload. But for a workload that scans through the entire dataset, it would have been up to 10% faster for the C array in which nulls take up no space, since in our experiment, 10% of all the values were null, and thus that version of the C array was 10% smaller than for the Arrow-specified arrays.

Variable-width data types

For variable width data types, Arrow stores the data for each element contiguously in memory without any separator bytes between the elements. In order to determine where one element ends and the next one begins, Arrow stores the byte offset of the first byte of each element of the array inside an integer array next to the raw data (there is an example in the next section below). In order to access a random element of the variable-width array, the integer array is accessed to find out the starting position of this and the next element in the raw data (the difference between these positions is the length of the element), and then the raw data is accessed.

The decision not to include separator bytes in the raw data between the elements makes the solution more general --- you don’t have to reserve special byte values for these separators. However, it slows down certain types of sequential access patterns. For example, I ran an experiment where I created an array of 12,500,000 variable-sized strings (average of 8 characters per string) using the StringBuilder API, and searched for a substring of size two characters within all elements of the array (extracting the index of all elements that contain the substring). I measured how long this query took when accessing the array both through the Arrow StringArray C++ API, and also over the raw Arrow data directly. Thirdly, I measured how long the same query took over a string array that included a separator byte between each element. The results are shown below:



In this case, the best performance was the array that was not created according to the Arrow specification. The reason for this is that the raw data could not be searched directly for the two-byte substring in the dataset created according to the Arrow specification, because the companion integer array containing the list of element boundaries needed to be repeatedly consulted to ensure that substring matches did not span multiple elements. However, when seperator bytes were located inside the array itself, no secondary array needed to be scanned.

It should be noted that string separators only accelerate certain types of queries, and I purposely chose one such query for this example. For queries that they do not accelerate, they tend to have to opposite effect, decreasing performance by bloating the size of the array. Furthermore, it should be reiterated at this point that reserving byte values for string separators would have prevented any application that do not reserve the same byte values from using Arrow, thereby limiting the scope of Arrow’s utility. In addition, many other queries can actually benefit from having the companion integer array. For example, an equality comparison (name == "johndoe") can utilize the integer array to ignore any value that has a different length. It should also be noted that any application that wishes to have string separators can simply add them to their strings directly, before creating the array using the StringBuilder API. So this experiment does not show a fundamental weakness of the Arrow standard --- it just indicates that in some cases you may have to add to the raw data in order to get optimal performance.

Nested Data

As self-describing data formats such as JSON become more popular, users are increasingly dropping the two-dimensional restrictions of relational tables and spreadsheets, and instead using nested models for their data. Arrow elegantly deals with nested data without requiring conceptual additions to the basic layout principles described above, where raw data is stored contiguously and offset arrays are used to quickly find particular data elements. For example, in a data set describing classes at the University of Maryland, I may want to nest the list of students in each class. For example, the data set:

Classes:
  Name: Introduction to Database Systems
  Instructor: Daniel Abadi
  Students: Alice
                    Bob
                    Charlie
  Name: Advanced Topics in Database Systems
  Instructor: Daniel Abadi
  Students: Andrew
                    Beatrice

could be stored as follows:

Name offsets: 0, 32, 67
Name values: Introduction to Database SystemsAdvanced Topics in Database Systems

Instructor offsets: 0, 12, 24
Instructor values: Daniel AbadiDaniel Abadi

Nested student list offsets: 0, 3, 5
Student offsets: 0, 5, 8, 15, 21, 29
Student values: AliceBobCharlieAndrewBeatrice

Note that the nested student attribute required two different offset lists: (1) Students are variable length and thus we need one offset list to specify where one student ends and the next one begins, just as for any variable length attribute; and (2) We need second offset list to indicate how many students exist per class. The "Nested student list offsets" accomplish this second goal --- it indicates that the first class has (3-0) students, and the second class has (5-3) students, etc.

Arrow currently allows list, struct, and various types of union type data to be nested in an attribute value.

Conclusion

It is important to separate out the specification of a standard from the tools and libraries that are provided in the current codebase that help developers with implementing this standard. As long as you are performing in-memory analytics where your workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard. The time is right for database systems architects to agree on and adhere to a main memory data representation standard. The proposed Arrow standard fits the bill, and I would encourage designers of main memory data analytics systems to adopt the standard by default unless they can find a compelling reason that representing their data in a different way will result in a significantly different performance profile (for example, Arrow’s attribute-contiguous memory layout is not ideal if your workloads are typically accessing multiple attributes from only a single entity, as is common in OLTP workloads). I also found the tools available in the codebase to read and write data using this standard to be easy to use and quick to get started with. However, I did find that at times, the code was slightly slower than the raw (and less general) implementation of the standard I wrote myself. Nonetheless, the existing codebase is good enough for most use cases and will likely help to further the acceleration of the adoption of the standard. Furthermore, additional performance enhancements to the codebase appear to be on their way, such as optimized LLVM-based processing modules.

Tuesday, October 31, 2017

Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

Apache Parquet and Apache ORC have become a popular file formats for storing data in the Hadoop ecosystem. Their primary value proposition revolves around their “columnar data representation format”. To quickly explain what this means: many people model their data in a set of two dimensional tables where each row corresponds to an entity, and each column an attribute about that entity. However, storage is one-dimensional --- you can only read data sequentially from memory or disk in one dimension. Therefore, there are two primary options for storing tables on storage: Store one row sequentially, followed by the next row, and then the next one, etc; or store the first column sequentially, followed by the next column, and then the next one, etc.


Storage layout difference between row- and column-oriented formats


For decades, the vast majority of data engines used row-oriented storage formats. This is because many early data application workloads revolved around reading, writing, and updating single entities at a time. If you store data using a columnar format and you want to extract all attributes for a particular entity, the system must jump around, finding the data for that entity from each of the separately-stored attributes. This results in a random access pattern, which results in slower access times than sequential access patterns. Therefore, columnar storage formats are a poor fit for workloads that tend to read and write entire entities at once, such as OLTP (transactional) workloads. Over time, workloads became more complex, and data analytics workloads emerged that tended to focus on only a few attributes at once; scanning through large quantities of entities to aggregate and/or process values of these attributes. Thus, storing data in a columnar fashion became more viable, and columnar formats resulted in sequential, high performance access patterns for these workloads.


Apache Arrow has recently been released with seemingly an identical value proposition as Apache Parquet and Apache ORC: it is a columnar data representation format that accelerates data analytics workloads. Yes, it is true that Parquet and ORC are designed to be used for storage on disk and Arrow is designed to be used for storage in memory. But disk and memory share the fundamental similarity that sequential access is faster than random access, and therefore the analytics workloads which tend to scan through attributes of data will perform more optimally if data is stored in columnar format no matter where data is stored --- in memory or on disk.  And if that’s the case, the workloads for which Parquet and ORC are a good fit will be the identical set of workloads for which Arrow is a good fit. If so, why do we need two different Apache projects?


Before we answer this question, let us run a simple experiment to validate the claimed advantages of column-stores. On an Amazon EC2 t2.medium instance, I created a table with 60,000,000 rows (entities) in main memory. Each row contained six attributes (columns), all of them 32-bit integers. Each row is therefore 24 bytes, and the entire table is almost 1.5GB. I created both row-oriented and column-oriented versions of this table, where the row-oriented version stores the first 24-byte row, followed by the next one, etc; and the column-oriented version stores the entire first column, followed by the next one, etc. I then ran a simple query ideally suited for column-stores --- I simply search the entire first column for a particular value. The column-oriented version of my table should therefore have to scan though just the first column and will never need to access the other five columns. Therefore it will need to scan through 60,000,000 values * 4 bytes per value = almost 0.25GB. Meanwhile the row-store will need to scan through the entire 1.5GB table because the granularity with which data can be passed from memory to the CPU (a cache line) is larger than the 24-byte tuple. Therefore, it is impossible to read just the relevant first attribute from memory without reading the other five attributes as well. So if the column-store has to read 0.25GB of data and the row-store has to read 1.5GB of data, you might expect the column-store to be 6 times faster than the row-store. However, the actual results are presented in the table below:




Surprisingly, the row-store and the column-store perform almost identically, despite the query being ideally suited for a column-store. The reason why this is the case is that I turned off all CPU optimizations (such as vectorization / SIMD processing) for this query. This resulted in the query being bottlenecked by CPU processing, despite the tiny amount of CPU work that has to happen for this query (just a simple integer comparison operation per row). To understand how it is possible for such a simple query to be bottlenecked by the CPU, we need to understand some basic performance specifications of typical machines. As a rough rule of thumb, sequential scans through memory can feed data from memory to the CPU at a rate of around 30GB a second. However, a modern CPU processor runs at approximately 3 GHz --- in other words they can process around 3 billion instructions a second. So even if the processor is doing a 4-byte integer comparison every single cycle, it is processing no more than 12GB a second --- a far smaller rate than the 30GB a second of data that can be sent to it. Therefore, CPU is the bottleneck, and it does not matter that the column-store only needs to send one sixth of the data from memory to the CPU relative to a row-store.


On the other hand, if I turn on CPU optimizations (by adding the ‘-O3’ compiler flag), the equation is very different. Most notably, the compiler can vectorize simple operations such as the comparison operation from our query. What this means is that originally each of the 60,000,000 integer comparisons that are required for this query happened sequentially --- each one occurring in a separate instruction from the previous. However, once vectorization is turned on, most processors can actually take four (or more) contiguous elements of our column, and do the comparison operation for all four of these elements in parallel --- in a single instruction. This effectively makes the CPU go 4 times faster (or more if it can do more than 4 elements in parallel). However, this vectorization optimization only works if each of the four elements fit in the processor register at once, which roughly means that they have to be contiguous in memory. Therefore, the column-store is able to take advantage of vectorization, while the row-store is not. Thus, when I run the same query with CPU optimizations turned on, I get the following result:




As can be seen, the EC2 processor appears to be vectorising 4 values per instruction, and therefore the column-store is 4 times faster than the row-store. However, the CPU still appears to be the bottleneck (if memory was the bottleneck, we would expect the column-store to be 6 times faster than the row-store).


We can thus conclude from this experiment that column-stores are still better than row-stores for attribute-limited, sequential scan queries like the one in our example and similar queries typically found in data analytics workloads. So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different. When the table is stored on disk, the CPU is much faster than the bandwidth with which data can be transferred from disk to the CPU. Therefore, column-stores are a win because they require less data to be transferred for these workloads. On the other hand, when the table is stored in memory, the amount of data that needs to be transferred is less relevant. Instead, column-stores are a win because they are better suited to vectorized processing.  


The reason why it is so important to understand the difference in bottleneck (even though the bottom line is the same) is that certain decisions for how to organize data into storage formats will look different depending on the bottleneck. Most notably, compression decisions will look very different. In particular, for data stored on disk, where the bandwidth of getting data from disk to CPU is the bottleneck, compression is almost always a good idea. When you compress data, the total size of the data is decreased, and therefore less data needs to be transferred. However, you may have to pay additional CPU processing costs to do the decompression upon arrival. But if CPU is not the bottleneck, this is a great tradeoff to make. On the other hand, if CPU is the bottleneck, such as our experiments above where the data was located in main memory, the additional CPU cost of decompression is only going to slow down the query.


Now we can understand some of the key differences between Apache Parquet/ORC and Apache Arrow. Parquet and ORC, since they are designed for disk-resident data, support high-ratio compression algorithms such as snappy (both), gzip (Parquet), and zlib (ORC) all of which typically require decompression before data processing (and the associated CPU costs). Meanwhile, Arrow, which is designed for memory-resident-data, does not support these algorithms. The only compression currently supported by Arrow is dictionary compression, a scheme that usually does not require decompression before data processing. For example, if you want to find a particular value in a data set, you can simply search for the associated dictionary-encoded value instead. I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing. Thus, I expect that Arrow will eventually support an expanded set of compression options beyond just dictionary compression. But it is far less likely that we will see heavier-weight schemes like gzip and snappy in the Apache Arrow library any time soon.


Another difference between optimizing for main memory and optimizing for disk is that the relative difference between random reads and sequential reads is much smaller for memory than for disk. For magnetic disk, a sequential read can be 2-3 orders of magnitude faster than a random read. However, for memory, the difference is usually less than an order of magnitude. In other words, it might take hundreds or even thousands of sequential reads on disk in order to amortize the cost of the original random read to get to the beginning of the sequence. But for main memory, it takes less than ten sequential reads to amortize the cost of the original random read. This enables the batch size of Apache Arrow data to be much smaller than batch sizes of disk-oriented storage formats. Apache Arrow actually fixes batches to be no more 64K records.


So to return back to the original question: do we really need a third column-store Apache project? I would say that there are fundamental differences between main-memory column-stores and disk-resident column-stores. Main-memory column-stores, like Arrow, need to be CPU optimized and focused on vectorized processing (Arrow aligns data to 64-byte boundaries for this reason) and low-overhead compression algorithms. Disk-resident column-stores need to be focused transfer-bandwidth and support higher-compression ratio algorithms. Therefore, it makes sense to keep Arrow and Parquet/ORC as separate projects, while also continuing to maintain tight integration.

Sunday, October 8, 2017

Hazelcast and the Mythical PA/EC System

(Editor’s note: I was unaware that Kyle Kingsbury was doing a linearizability analysis of Hazelcast when I was writing this post. Kyle’s analysis resulted in Greg Luck, Hazelcast’s CEO, to write a blog post where he cited the PACELC theorem, and came to some of the same conclusions that I came to in writing this post. This post, however, was 98% written before both Kyle’s and Greg’s posts, but their posts got me to accelerate the completion of my analysis and publish it now.)

Seven years ago, I introduced the PACELC theorem as a mechanism to more clearly explain the consistency tradeoffs in building distributed systems. At that time, many people were familiar with the consistency vs. availability trade-off that was made well-known by the CAP theorem. However, it was common amongst people unfamiliar with the details of CAP theorem to believe that this tradeoff is always present in a distributed system. However, the truth is that the CAP consistency-availability tradeoff actually describes a very rare corner case. Only when there is an actual network partition --- an increasingly unusual event in modern day infrastructures ---  does the consistency-availability tradeoff present itself. At all other times, it is possible to be both available and consistent. Nonetheless, many systems choose not to be fully consistent at all times. The reason for this has nothing to do with the CAP tradeoff. Instead, there is a separate latency vs. consistency tradeoff. Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost. By relaxing consistency and serving reads and writes directly from a closest replica (without synchronization with other replicas), latency can be improved --- sometimes by an order of magnitude.
Therefore, I felt that it was important to clearly tease apart these separate consistency tradeoffs. This led to the PACELC theorem: if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?
In general, the PACELC theorem leads to four categories of systems: PC/EC, PA/EL, PC/EL, and PA/EC. However, in practice, an application will either go to the effort of building on top of a reduced consistency system or it will not. If it goes to this effort, it stands to benefit in two areas: availability upon a partition, and latency in everyday operation. It is unusual for a system to go to this effort and choose only to attain benefit in one area. Hence, two of these four categories are more common than the other two: PC/EC systems designed for applications that can never sacrifice consistency, and PA/EL systems that are designed for applications that are capable of being built over a reduced consistency system. Despite being less common, PACELC nonetheless theorizes about the existence of PC/EL and PA/EC systems. At the time when I originally introduced PACELC, I gave the example of PNUTS as a PC/EL system. However, I could not think of any good examples of PA/EC systems. Even in my extended article on PACELC in the CAP-anniversary edition of IEEE Computer, I only gave a somewhat hand-wavey example of a PA/EC system.
The basic problem with PA/EC systems is the following: although partitions are a rare event, they are not impossible. Any application built on top of a PA system must have mechanisms in place to deal with inconsistencies that arise during these partition events. But once they have these mechanisms in place, why not benefit during normal operation and get better latency as well?
Over the past few weeks, I have been looking more deeply at the In-Memory Data Grid (“IMDG”) market, and took an especially deep dive into Hazelcast, a ubiquitous open source implementation of a IMDG, with hundreds of thousands of in production deployments. It turns out that Hazelcast (and, indeed, most of the in-memory data grid industry) is a real implementation of the mythical PA/EC system.
In order to understand why PA/EC makes sense for Hazelcast and other IMDGs, we need to first discuss some background material on (1) Hazelcast use cases, (2) data replication and (3) PACELC.

Hazelcast use cases
The most common use case for Hazelcast is the following. Let’s say that you write a Java program that stores and manipulates data inside popular Java collections and data structures, e.g., Queue, Map, AtomicLong, or Multimap. You may want to run this program in multiple different clients, all accessing the same Java collections and data structures. Furthermore, these data structures may get so large that they cannot fit in memory on a single server. Hazelcast comes to the rescue --- it provides a distributed implementation of these Java data structures, thereby enabling scalable utilization of them. Users interact with Hazelcast the same way that they interacted with their local data structures, but behind the scenes, Hazelcast is distributing and replicating them across a cluster of machines.
The vast majority of Hazelcast use cases are within a single computing cluster. Both the client programs and the Hazelcast data structures are located in the same physical region.

Data replication
In general, any arbitrary system may choose to replicate data for one of two primary reasons: Either they want to improve fault tolerance (if a server containing some of the data fails, a replica server can be accessed instead), or they want to improve request latency (messages that have to travel farther distances take longer to transmit; therefore, having a replica of the data “near” locations from which they are typically accessed can improve request latency).
As mentioned above, in-memory data grids are typically running in the same region as the clients which access them. Therefore, only the first reason to replicate data (fault tolerance) applies. (This reason alone is enough to justify the costs of replication in any scalable system. The more physical machines that exist in the system, the more likely it is that at least one machine will fail at any given point in time. Therefore, the bigger the system, the more you need to replicate for fault tolerance).
If the replicas only exist for fault tolerance and not for performance, there is no fundamental requirement to ever access them except in the case of a failure. All reads and writes can be directed to the primary copy of any data item, with the replicas only ever accessed if the primary is not available. (In such a scenario, it is a good idea to mix primary and replica partitions on servers across the cluster, in order to prevent underutilization of server resources.) If all reads and writes go to the same location, this leads to 100% consistency and linearizability (in the absence of failures) since it is easy for a single server to ensure that reads reflect the most recent writes.

What this means for PACELC
Recall what I wrote above about the latency vs. consistency tradeoff: “Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost.” In truth, option (2) does not come with a latency cost when all requests originate from a location closest to the master replica. It’s only when messages travel for longer than the distance to the nearest replica where a cost materializes. In order words, there is no consistency vs. latency tradeoff in the typical Hazelcast use case.
Thus, we should clarify at this point that the PACELC theorem assumes that requests may originate from any arbitrary location. The ELC part of PACELC disappears if all requests come from the same location. I would argue that the CAP theorem makes the same assumption, but such an argument is not as straightforward, and requires a refined discussion about the CAP theorem which is outside scope of this particular blog post.


Failures and partitions
Up until now, we have said that as long as the master node does not fail, if it serves all reads and writes, then full consistency is achieved. The obvious next question is: what happens if the master node fails and a new master takes over? In such a scenario, the ability of the system to maintain consistency depends on how replication is performed. If replication was asynchronous, then consistency cannot be guaranteed, since some updates may have been performed on the old master, but had not yet been replicated to the new master before the old master failed. If all data had been synchronously replicated to the new master, then full consistency can still be guaranteed.
A failed node is logically equivalent to a partition where the failed node is located in one partition and every other node is in the other partition, and all client requests can reach the second partition but not the first. If the failed node is the master node, and replication was asynchronous, then both the CAP theorem and the PAC part of PACELC state that there are only two choices: quiesce the entire system since the only consistent copy is not accessible (i.e. choosing consistency over availability), or serve reads and writes from nodes in the available partition (i.e. choosing availability over consistency).  
Hazelcast by default uses “synchronous” replication, which is actually an interesting hybrid between asynchronous and synchronous replication. The master asynchronously sends the writes to the replicas, and each replica acknowledges these writes to the client. The client synchronously waits for these acknowledgments before returning from the call. However, if the requisite number of acknowledgments do not arrive before the end of a time out period, the call either returns with the write succeeding or throws an exception, depending on configuration. If Hazelcast had been configured to throw an exception, the client can retry the operation.  Hazelcast also has an anti-entropy algorithm that works offline to re-synchronize replicas with the master to repair missed replications. However, either way --- until the point where the missed replication has been repaired either through the anti-entropy algorithm or through a client retry, the system is temporarily in a state where the write has succeeded on the master but not on at least one replica.
In addition to the hybrid synchronous algorithm described above, Hazelcast also can be configured to use standard asynchronous replication. When configured in this way, the client does not wait for acknowledgments from the replicas before returning from the call. Thus, updates that failed to get replicated will go undetected until the anti-entropy algorithm identifies and repairs the missed replication.  
Thus, either way --- whether Hazelcast is configured to use standard asynchronous replication or to use the default hybrid “synchronous” model --- it is possible for the write call to return with the write only succeeding on the master.
If the master node fails, Hazelcast selects a new master to serve reads and writes, even though (as we just mentioned) it is possible that the new master does not have all the writes from the original master. If there is a network partition, the original master will remain the master for its partition, but the other partition will select its own master. Again, this second master may not have all the writes from the original master. Furthermore, a full split brain situation may occur, where the masters for the two different partitions independently accept writes to their partition, thereby causing the partitions to diverge further. However, Hazelcast does have a “split brain protection” feature that prevents significant divergence. The way this feature works is that the system can be configured to define a minimum size for read and write operations. If this minimum size is set to be larger than half of the size of the cluster, then the smaller partition will not accept reads and writes, which prevents further divergence from the larger partition. However, it can take 10s of seconds for the smaller partition to realize how small it is (although Hazelcast claims it will be much faster than this in 3.9.1 and 3.10). Thus there is a delay before the split brain protection kicks in, and the partitions can diverge during this delay period.
The bottom line here is that both if the master fails and also in the (rare) case of a network partition, a new master is selected that may not have all the updates from the original master. The system always remains available, but the second master is allowed to temporarily diverge from the original master. Thus, Hazelcast is PA/EC in PACELC. If the master has failed or partitioned, Hazelcast choses availability over consistency. However, in the absence of failures or partitions, Hazelcast is fully consistent. (As mentioned above, Hazelcast also achieves low latency in the absence of failures or partitions in their primary use case. However, it is appropriate to label Hazelcast EC rather than EL since if a request were to theoretically originate in a location that is far from the master, it would still choose consistency over latency and serve the request from the master.)
Indeed, any system that that serves reads and writes from the master, but elects a new master upon a failure, where this new master is not 100% guaranteed to have seen all of the writes from the original master, will be PA/EC in PACELC. So the PA/EC category is larger than I originally had expected.
I would still argue, however, that PA/EC systems are fundamentally confusing to the end user. If the system cannot guarantee consistency in all cases, then the end user is forced to handle cases of inconsistency in application logic. And once they have the code written to handle these cases (e.g., by including merge functions that resolve situations where replicas may diverge), then the value of the system being consistent in the absence of failures or partitions is significantly reduced. PA/EC systems thus only make sense for applications for which availability takes priority over consistency, but where the code that handles inconsistencies needs to be run as infrequently as possible --- e.g. when the code involves a real world charge (such as refunding a customer’s account) or significant performance costs.
Since not all applications fit into the above category, I suspect that many PA/EC systems will have settings to either increase consistency in order to become fully consistent (i.e. become PC instead of PA) or reduce consistency guarantees in the “else case” (i.e., become EL instead of EC).
Indeed, Hazelcast is such a system and can be configured to be EL rather than EC. There are several ways to accomplish this, but the primary mechanism is through their Near Cache feature. Near Cache is a client side cache of recently accessed data items. If the data items stored in the Near Cache are updated by a different client, these changes are not synchronously replicated to the first client’s Near Cache. Hence, the Near Cache is not kept consistent with the master version of the data (instead it is “eventually consistent”). However, reads by the client are served by its Near Cache if a copy of the data item to be read is stored there. Therefore, excellent latency (less than one microsecond) can be achieved at the cost of consistency --- EL in PACELC.
Furthermore, Hazelcast also supports replication of clusters over a WAN. For example, in a disaster recovery use case, all writes go to the primary cluster, and they are asynchronously replicated to a backup cluster. Alternatively, both clusters can accept writes, and they are asynchronously replicated to the other cluster (the application is responsible for resolving conflicts of conflicting writes to the different clusters using a conflict resolution strategy registered with Hazelcast). Unlike what we discussed earlier, in this case read requests may originate from arbitrary locations rather than always from a location near the master. Hazelcast serves these reads from the closest location, even though it may not have the most up to date copy of the data. Thus, Hazelcast is EL by default for WAN replication.  
In summary, through my investigation of Hazelcast (and in-memory data grids in general), I have discovered a new category of PA/EC systems. However, due to the confusing nature of PA/EC systems, it is no surprise that Hazelcast can be configured to be PA/EL in addition to its PA/EC default.

Thursday, April 6, 2017

Distributed consistency at scale: Spanner vs. Calvin

Introduction

In 2012, two research papers were published that described the design of geographically replicated, consistent, ACID compliant, transactional database systems. Both papers criticized the proliferation of NoSQL database systems that compromise replication consistency and transactional support, and argue that it is very possible to build extremely scalable, geographically replicated systems without giving up on consistency and transactional support. The first of these papers was the Calvin paper, published in SIGMOD 2012. A few months later, Google published their Spanner paper in OSDI 2012. Both of these papers have been cited many hundreds of times and have influenced the design of several modern “NewSQL” systems, including FaunaDB (where this post is also being published).
Recently, Google released a beta version of their Spanner implementation, available to customers who use Google Cloud Platform. This development has excited many users seeking to build scalable apps on Google’s cloud, since they now have a reliably scalable and consistent transactional database system to use as a foundation. However, the availability of Spanner outside of Google has also brought it more scrutiny --- what are its technical advantages in practice, and what are its costs? Even though it has been five years since the Calvin paper was published, it is only now that the database community is asking me to directly compare and contrast the technical designs of Calvin and Spanner.
The goal of this post is to do exactly that --- compare the architectural design decisions made in these two systems, and specifically focus on the advantages and disadvantages of these decisions against each other as they relate to performance and scalability. This post is focused on the protocols described in the original papers from 2012. Although the publicly available versions of these systems likely have deviated from the original papers, the core architectural distinctions remain the same.

The CAP theorem in context

Before we get started, allow me to suggest the following: Ignore the CAP theorem in the context of this discussion. Just forget about it. It’s not relevant for the type of modern architectural deployments discussed in this post where network partitions are rare.
Both Spanner and Calvin replicate data across independent regions for high availability. And both Spanner and Calvin are technically CP systems from CAP: they guarantee 100% consistency (serializability, linearizability, etc.) across the entire system. Yes, when there is a network partition, both systems make slight compromises on availability, but partitions are rare enough in practice that developers on top of both systems can assume a fully-available system to many 9s of availability.
(BTW: If you didn’t believe me in 2010 when I explained the shortfalls of using CAP to understand the practical consistency and availability properties of modern systems, maybe you will believe the author of the CAP theorem himself, Eric Brewer, who recommends against analyzing Spanner through the lens of the CAP theorem.)

Ordering transactions in time

Let us start our comparison of Calvin and Spanner with the most obvious difference between them: Spanner’s use of “TrueTime” vs. Calvin’s use of “preprocessing” (or “sequencing” in the language of the original paper) for transaction ordering. In fact, most of the other differences between Spanner and Calvin stem from this fundamental choice.
A serializable system provides a notion of transactional ordering. Even though many transactions may be executed in parallel across many CPUs and many servers in a large distributed system, the final state (and all observable intermediate states) must be as if each transaction was processed one-by-one. If no transactions touch the same data, it is trivial to process them in parallel and maintain this guarantee. However, if the transactions read or write each other’s data, then they must be ordered against each other, with one considered earlier than the other. The one considered “later” must be processed against a version of the database state that includes the writes of the earlier one. In addition, the one considered “earlier” must be processed against a version of the database state that excludes the writes of the later one.

Locking and logging

Spanner uses TrueTime for this transaction ordering. Google famously uses a combination of GPS and atomic clocks in all of their regions to synchronize time to within a known uncertainty bound. If two transactions are processed during time periods that do not have overlapping uncertainty bounds, Spanner can be certain that the later transaction will see all the writes of the earlier transaction.
Spanner obtains write locks within the data replicas on all the data it will write before performing any write. If it obtains all the locks it needs, it proceeds with all of its writes and then assigns the transaction a timestamp at the end of the uncertainty range of the coordinator server for that transaction. It then waits until this later timestamp has definitely passed for all servers in the system (which is the entire length of the uncertainty range) and then releases locks and commits the transaction. Future transactions will get later timestamps and see all the writes of this earlier transaction. Thus, in Spanner, every transaction receives a timestamp based on the actual time that it committed, and this timestamp is used to order transactions. Transactions with later timestamps see all the writes of transactions with earlier timestamps, with locking used to enforce this guarantee.
In contrast, Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed. In more detail: clients submit transactions to the preprocessing layer of their local region, which then submits these transactions to the global log via a cross-region consensus process like Paxos. This is similar to a write-ahead log in a traditional, non-distributed database. The order that the transactions appear in this log is the official transaction ordering. Every replica reads from their local copy of this replicated log and processes transactions in a way that guarantees that their final state is equivalent to what it would have been had every transaction in the log been executed one-by-one.

Replication overhead

The design difference between TrueTime vs. preprocessing directly leads to a difference in how the systems perform replication. In Cavlin, the replication of transactional input during preprocessing is the only replication that is needed. Calvin uses a deterministic execution framework to avoid *all* cross-replication communication during normal (non-recovery mode) execution aside from preprocessing. Every replica sees the same log of transactions and guarantees not only a final state equivalent to executing the transactions in this log one-by-one, but also a final state equivalent to every other replica.
This requires the preprocessor to analyze the transaction code and “pre-execute” any nondeterministic code (e.g. calls to sys.random() or time.now()). (The implication of this in terms of what types of transactions are supported by Calvin are discussed at the end of this post.) Once all code within a transaction is deterministic, a replica can safely focus on just processing the transactions in the log in the correct order without concern for diverging with the other replicas.  
In contrast, since Spanner does not do any transaction preprocessing, it can only perform replication after transaction execution. Spanner performs this replication via a cross-region Paxos process.

The cost of two-phase commit

Another key difference between Spanner and Calvin is how they commit multi-partitioned transactions. Both Calvin and Spanner partition data into separate shards that may be stored on separate machines that fail independently from each other. In order to guarantee transaction atomicity and durability, any transaction that accesses data in multiple partitions must go through a commit procedure that ensures that every partition successfully processed the part of the transaction that accessed data in that partition. Since machines may fail at any time, including during the commit procedure, this process generally takes two rounds of communication between the partitions involved in the transaction. This two-round commit protocol is called “two phase commit” and is used in almost every ACID-compliant distributed database system, including Spanner. This two phase commit protocol can often consume the majority of latency for short, simple transactions since the actual processing time of the transaction is much less than the delays involved in sending and receiving two rounds of messages over the network.
The cost of two phase commit is particularly high in Spanner because the protocol involves three forced writes to a log that cannot be overlapped with each other. In Spanner, every force write to a log involves a cross-region Paxos agreement, so the latency of two phase commit in Spanner is at least equal to three times the latency of cross-region Paxos.

Determinism is durability

In contrast to Spanner, Calvin leverages deterministic execution to avoid two-phase commit. Machine failures do not cause transactional aborts in Calvin. Instead, after a failure, the machine that failed in Calvin re-reads the input transaction log from a checkpoint, and deterministically replays it to recover its state at the time of the failure. It can then continue on from there as if nothing happened. As a result, the commit protocol does not need to worry about machine failures during the protocol, and can be performed in a single round of communication (and in some cases, zero rounds of communication --- see the original paper for more details).

Performance

At this point, I think I have provided enough details to make it possible to present a theoretical comparison of the bottom line performance of Calvin vs. Spanner for a variety of different types of requests.  This comparison assumes a perfectly optimized and implemented version of each system.

Transactional write latency

A transaction that is “non-read-only” writes at least one value to the database state. In Calvin, such a transaction must pay the latency cost of preprocessing, which is roughly the cost of running cross-region Paxos to agree to append the transaction to the log. After this is complete, the remaining latency is the cost of processing the transaction itself, which includes the zero or one-phase commit protocol for distributed transactions.
In Spanner, there is no preprocessing latency, but it still has to pay the cost of cross-region Paxos replication at commit time, which is roughly equivalent to the Calvin preprocessing latency. Spanner also has to pay the commit wait latency discussed above (which is the size of the time uncertainty window), but this can be overlapped with replication. It also pays the latency of two phase commit for multi-partition transactions.
Thus, Spanner and Calvin have roughly equivalent latency for single-partition transactions, but Spanner has worse latency than Calvin for multi-partition transactions due to the extra phases in the transaction commit protocol.

Snapshot read latency

Both Calvin and Spanner keep around older versions of data and read data at a requested earlier timestamp from a local replica without any Paxos-communication with the other replicas.
Thus, both Calvin and Spanner can achieve very low snapshot-read latency.

Transactional read latency

Read-only transactions do not write any data, but they must be linearizable with respect to other transactions that write data. In practice, Calvin accomplishes this via placing the read-only transaction in the preprocessor log. This means that a read-only transaction in Calvin must pay the cross-region replication latency. In contrast, Spanner only needs to submit the read-only transaction to the leader replica(s) for the partition(s) that are accessed in order to get a global timestamp (and therefore be ordered relative to concurrent transactions). Therefore, there is no cross-region Paxos latency --- only the commit time (uncertainty window) latency.
Thus, Spanner has better latency than Calvin for read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction.

Scalability

Both Spanner and Calvin are both (theoretically) roughly-linearly scalable for transactional workloads for which it is rare for concurrent transactions to be accessing the same data. However, major differences begin to present themselves as the conflict rate between concurrent transactions starts to increase.
Both Spanner and Calvin, as presented in the paper, use locks to prevent concurrent transactions from interfering with each other in impermissible ways. However, the amount of time they hold locks for an identical transaction is substantially different. Both systems need to hold locks during the commit protocol. However, since Calvin’s commit protocol is shorter than Spanner’s, Calvin reduces the lock hold time at the end of the transaction. On the flip side, Calvin acquires all locks that it will need at the beginning of the transaction, whereas Spanner performs all reads for a transaction before acquiring write locks. Therefore, Spanner reduces lock time at the beginning of the transaction.
However, this latter advantage for Spanner is generally outweighed by the former (extra lock-hold time at the end of the transactions) disadvantage, since, as discussed above, the latency of two-phase commit in Spanner involves at least three iterations of cross-region Paxos. Furthermore, Spanner has an additional major disadvantage relative to Calvin in lock-hold time: Spanner must also hold locks during replication (which, as mentioned above, is also a cross-region Paxos process). The farther apart the regions, the larger the latency of this replication, and therefore, the longer Spanner must hold locks.
In contrast, Calvin does its replication during preprocessing, and therefore does not need to hold locks during replication. This leads to Calvin holding locks for much shorter periods of time than Spanner, and therefore being able to process more concurrent transactions in parallel that conflict with each other.
A second difference that can affect scalability is the following: Calvin requires only a single Paxos group for replicating the input log. In contrast, Spanner requires one independent Paxos group per shard, with proportionally higher overhead.
Overall, Calvin has higher throughput scalability than Spanner for transactional workloads where concurrent transactions access the same data. This advantage increases with the distance between datacenters.

Limitations

In order to implement deterministic transaction processing, Calvin requires the preprocessor to analyze transactions and potentially “pre-execute” any non-deterministic code to ensure that replicas do not diverge. This implies that the preprocessor requires the entire transaction to be submitted at once. This highlights another difference between Calvin and Spanner --- while Spanner theoretically allows arbitrary client-side interactive transactions (that may include external communication), Calvin supports a more limited transaction model.
There are some subtle, but interesting differences between Calvin and Spanner in rare situations where every single replica for a shard is unavailable, or if all but one are unavailable, but these differences are out of scope for this post.

Conclusion

I’m obviously biased in favor of Calvin, but in going through this exercise, I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin. The only place where I could find that Spanner has a clear performance advantage over Calvin is for latency of read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction. Since any complex transaction is likely to touch multiple partitions, this is almost impossible to guarantee in a real-world setting.
However, many real-world workloads do not require client-side interactive transactions, and furthermore only need transactional support for writes and are satisfied with performing reads against a snapshots (after all, this is the default isolation model of many SQL systems). It seems to me that Calvin is the better fit for a large class of modern applications.