Tuesday, July 15, 2014

Optimizing Non-Sequential Independent Loads with Software Prefetching

Memory accesses on computers can be described in many ways, but I'd like to start off by grossly oversimplifying them into two categories:
  • Sequential or Non-Sequential
  • Independent or Dependent
If you don't know the difference between Sequential and Non-Sequential access then I'm sorry but you are not the target audience of this blog post and you should stop reading now. Independent/Dependent access is terminology that I made up so I have to define it. A load is dependent on a preceding load if the value read in the preceding load is used to determine the address to be read in the later load. If this is not the case between two loads, those loads are independent of each other.

Examples of Dependent Loads:
  • Linked List Traversal
  • Binary Search Tree Traversal
  • B+Tree Traversal
It turns out that most data structures built from nodes and pointers require lots of dependent loads.

Examples of Independent Loads:
  • Looking up a key in an open addressed hash table using double hashing for collision resolution
  • Looking up the document values in an array for a list of integer doc ids matching a query in an inverted index
Most independent loads result from looking up a known or cheaply calculated list of indexes in a larger array.

It doesn't really make sense for sequential accesses to be dependent, so this really gives us three classifications of access patterns:
  • Sequential
  • Independent Non-Sequential
  • Dependent Non-Sequential
Modern computers are best at sequential access, OK at independent access, and awful at dependent access. Here is a table containing time taken to sum 1 billion elements from a table containing 16 million 32 bit integers with each access pattern:

Sequential: 1.5 seconds
Independent: 13 seconds
Dependent: 92 seconds

The difference between sequential access and dependent non-sequential access is enormous, but most people reading a blog post about software prefetching would expect that. The weird case is independent access, lying right between sequential and dependent on a logarithmic scale. Why is independent access so much faster than dependent access? The answer is instruction level parallelism. Your processor can do 5-10 independent loads in parallel while it can only do one dependent load at a time. Each load occupies a line fill buffer in the L1D cache which is not freed until that load completes, which is why the improvement is only ~10x and not higher.

Let's take a look at the program I ran for the independent non-sequential benchmark:

As you can see it uses a linear congruential random number generator to generate random addresses to read and then reads the data at those addresses. Linear congruential generators are extremely fast on modern processors so the dominant factor in the runtime here is loads from memory. Since it doesn't really matter where in the cache line the cache miss happens, i've taken the liberty to align all accesses to the beginning of the cache line:

After doing this the program still takes exactly the same amount of time to run. Next let's try summing not just the first number in the cache line, but all the numbers in the cache line. Since we've already taken the penalty for the cache miss, in theory this should be essentially free. Here is the code to do this:

Let's run this benchmark just to make sure our assumptions are true. Wait, this takes 48 seconds now? Why does it take 4 times longer when we're doing exactly the same number of random access loads? It turns out that current Intel x86_64 processors are not so great at optimizing this access pattern except in the most basic of cases. Each extra addition in the loop requires a load, and each of those loads occupies one of our precious line fill buffers even though there are already a ton of other outstanding loads for the same cache line! This removes most of the gains we were getting from instruction level parallelism.

If we want to ensure that the data we need is in the cache at the time that we need it, we need to issue a software prefetch instruction well in advance of then. This next code sample shows prefetching the first 16 cache lines then as we sum them we prefetch the cache line from 16 lines in the future.

This code produces exactly the same output yet runs in 11 seconds! This is faster than our original code only loading one value! Software prefetching was a huge performance win!

Next time you're dealing with a non-sequential access pattern, ask yourself if it's dependent or independent. If it's independent, software prefetching may be able to speed it up greatly.