Tuesday, July 15, 2014

Optimizing Non-Sequential Independent Loads with Software Prefetching

Memory accesses on computers can be described in many ways, but I'd like to start off by grossly oversimplifying them into two categories:
  • Sequential or Non-Sequential
  • Independent or Dependent
If you don't know the difference between Sequential and Non-Sequential access then I'm sorry but you are not the target audience of this blog post and you should stop reading now. Independent/Dependent access is terminology that I made up so I have to define it. A load is dependent on a preceding load if the value read in the preceding load is used to determine the address to be read in the later load. If this is not the case between two loads, those loads are independent of each other.

Examples of Dependent Loads:
  • Linked List Traversal
  • Binary Search Tree Traversal
  • B+Tree Traversal
It turns out that most data structures built from nodes and pointers require lots of dependent loads.

Examples of Independent Loads:
  • Looking up a key in an open addressed hash table using double hashing for collision resolution
  • Looking up the document values in an array for a list of integer doc ids matching a query in an inverted index
Most independent loads result from looking up a known or cheaply calculated list of indexes in a larger array.

It doesn't really make sense for sequential accesses to be dependent, so this really gives us three classifications of access patterns:
  • Sequential
  • Independent Non-Sequential
  • Dependent Non-Sequential
Modern computers are best at sequential access, OK at independent access, and awful at dependent access. Here is a table containing time taken to sum 1 billion elements from a table containing 16 million 32 bit integers with each access pattern:

Sequential: 1.5 seconds
Independent: 13 seconds
Dependent: 92 seconds

The difference between sequential access and dependent non-sequential access is enormous, but most people reading a blog post about software prefetching would expect that. The weird case is independent access, lying right between sequential and dependent on a logarithmic scale. Why is independent access so much faster than dependent access? The answer is instruction level parallelism. Your processor can do 5-10 independent loads in parallel while it can only do one dependent load at a time. Each load occupies a line fill buffer in the L1D cache which is not freed until that load completes, which is why the improvement is only ~10x and not higher.

Let's take a look at the program I ran for the independent non-sequential benchmark:

As you can see it uses a linear congruential random number generator to generate random addresses to read and then reads the data at those addresses. Linear congruential generators are extremely fast on modern processors so the dominant factor in the runtime here is loads from memory. Since it doesn't really matter where in the cache line the cache miss happens, i've taken the liberty to align all accesses to the beginning of the cache line:

After doing this the program still takes exactly the same amount of time to run. Next let's try summing not just the first number in the cache line, but all the numbers in the cache line. Since we've already taken the penalty for the cache miss, in theory this should be essentially free. Here is the code to do this:

Let's run this benchmark just to make sure our assumptions are true. Wait, this takes 48 seconds now? Why does it take 4 times longer when we're doing exactly the same number of random access loads? It turns out that current Intel x86_64 processors are not so great at optimizing this access pattern except in the most basic of cases. Each extra addition in the loop requires a load, and each of those loads occupies one of our precious line fill buffers even though there are already a ton of other outstanding loads for the same cache line! This removes most of the gains we were getting from instruction level parallelism.

If we want to ensure that the data we need is in the cache at the time that we need it, we need to issue a software prefetch instruction well in advance of then. This next code sample shows prefetching the first 16 cache lines then as we sum them we prefetch the cache line from 16 lines in the future.

This code produces exactly the same output yet runs in 11 seconds! This is faster than our original code only loading one value! Software prefetching was a huge performance win!

Next time you're dealing with a non-sequential access pattern, ask yourself if it's dependent or independent. If it's independent, software prefetching may be able to speed it up greatly.

Thursday, October 31, 2013

How To Write A File

You probably think that you know how to write to a file. After all, it's pretty simple and you do it many times per day. You open the file, write some data to it, close the file, and you're done. Wrong. This is how you write a file:

  • Create and open a temporary file in the directory you are writing your new file to.
  • Write your data to the temporary file.
  • fsync the temporary file. (fdatasync is insufficient, it doesn't sync the size)
  • Close the temporary file.
  • If any errors happen before this point, delete the temporary file and handle the error as appropriate.
  • Rename your temporary file to the name you want your new file to have.
  • Open the directory containing both files.
  • fsync the directory's file descriptor.
  • Close the directory's file descriptor.
If you open the file that you are writing to directly, several problems happen. If your program crashes before you've finished writing your data, your system is in an inconsistent state. Worse yet, if you overwrote the contents of a previous file with the same name, you have lost that file and you are left with only part of your new file. Any other program that had the old file open will get weird errors due to the unexpected truncation to zero length. By creating a temporary file your new file has its own inode and your old file is left alone.

If you don't fsync the file before you rename it you can still get into inconsistent states. Your filesystem journals all of your metadata operations but not your data operations. That means that if you lose power after the rename has been issued, it is possible that the rename will be replayed from the journal yet the data itself was never written to disk, so it's gone forever along with your old file of the same name.

If you write to a temporary file, fsync, and rename you know that:
  • Writing to your temporary file leaves the old version of the file alone (if it exists)
  • After the fsync is complete your data has safely been written all the way through to your disk (unless you have one of those crappy disks that lie about sync, in which case you should throw it away and buy a new disk)
  • At any point immediately before, during, or after the rename operation you have exactly one version of your file and it is complete.
The only remaining problem is what happens if the system crashes before the directory entry edits from the rename get sync'ed to disk. To guarantee that the directory contents are sync'ed you must open the directory and call fsync on the returned file descriptor.

Now you know how to write a file. Go tell all your friends, I'm tired of using programs that get this wrong.

Tuesday, December 14, 2010

building arm toolchain, compiling, and loading into flash on stm32 discovery

you need macports for this to work

to build the arm toolchain go here:

in a terminal do:
git clone https://github.com/esden/summon-arm-toolchain.git

edit summon-arm-toolchain
set SUDO=sudo
set LIBSTM32_EN=1
save and go back to terminal

follow the instructions in the README

this will take a long time and you have to type in your password a few times

add ~/sat/bin to PATH in your .bash_profile

download this: http://www.robsons.org.uk/blinky.zip

unzip it, edit Makefile and everywhere it says arm-elf change it to arm-none-eabi
now run openocd and telnet localhost 4444

run these commands in telnet:
stm32x unlock
flash erase_sector 0 0 last
flash write_bank 0 ~/Downloads/blinky/blinky.bin 0
reset init

the green led should blink, the blue led should blink twice as fast, and when you hold the usr button the green led should stop blinking.


OpenOCD on OS X using flyswatter and stm32 discovery

Making OpenOCD talk to my flyswatter and stm32 discovery using jtag was kind of a pain. I'm going to write up some instructions on how i made it work in case someone else needs it in the future (or i have to do it again).

You need:
a breadboard
an stm32 discovery
a tin can tools flyswatter
OpenOCD from git (0.4.0 doesn't work)
most recent commit on version i'm using is on 12/10/2010, cbf48bed6a26279900ad00e6d6462a7f29676175
libftdi (0.18)

0) i removed solder bridges 11, 16, 17, and 18 but i'm not sure if that's necessary. i wasn't sure what the ST Link does when you power the board so i just disconnected sb11, sb17, and sb18. i shouldn't have removed sb16 so now i have to put a 510 ohm resistor between BOOT and ground.

1) libusb-compat:
sudo make install

2) libftdi:
sudo make install

3) edit openocd/tcl/target/stm32.cfg
after the line that says:
set _BSTAPID5 0x06418041
add a new line that says:
set _BSTAPID6 0x06420041
change the line that says:
-expected-id $_BSTAPID4 -expected-id $_BSTAPID5
-expected-id $_BSTAPID4 -expected-id $_BSTAPID5 -expected-id $_BSTAPID6

4) OpenOCD:
./autoreconf -i -f
./configure --enable-maintainer-mode --enable-ft2232_libftdi --enable-usbprog
sudo make install

5) create a file called openocd.cfg with these contents:

#daemon configuration
telnet_port 4444
gdb_port 3333

source [find interface/flyswatter.cfg]
source [find target/stm32.cfg]

#i'm not sure if setting WORKAREASIZE is essential, but the
#default is 16k (0x4000) which is more ram than the stm32
#discovery has so i lowered it

6) to run (in same directory as openocd.cfg):

7) to telnet in:
telnet localhost 4444

8) to probe flash:
flash probe 0

i haven't changed what's on the flash yet but i'll update once i get gcc compiled and have something to flash it with.

here's a picture of my flyswatter setup/breadboard/power supply

here's a picture of my crazy jtag octopus connection

Monday, June 21, 2010

automaticly balancing data when growing a cluster

I've been thinking about load balancing lately and how to dynamically grow a cluster and its data set. I'd like to define "balance factor" as follows:

If the number of nodes in the cluster grows by a factor of x, at some future point when the data set has also grown by a factor of x the balance factor is the ratio of the smallest node to the largest node.

In my use case it is impractical for any node to know the size of all the nodes. Load balancing decisions must be made probabilistically on limited data. Today I happened across this blog post, which is very applicable to my scenario. It presents a very good solution for load balancing between a fixed number of bins, but when adding nodes without taking the system offline it is useful to look at more than 2 random points to maximize balance factor.

I did some experiments on different values for x and n (where n is the number of random points examined), and experimentally determined an approximate equation for balance factor b:

b = 1-(1/x)^(n-1)

(by approximate I mean that it is close enough for the range of values I care about, which is x between 1.1 and 10 and n between 2 and 25. It might be exactly right but I don't have time to do a proof before band practice.)

If you actually want to use this though, you need to determine n given x and the desired b. With some manipulation we get:

n = 1 + log(1-b)/log(1/x)

Let me know if you found this useful.

Saturday, March 28, 2009

I finally decided that I should do something with jeffplaisance.com