I've been thinking about load balancing lately and how to dynamically grow a cluster and its data set. I'd like to define "balance factor" as follows:

If the number of nodes in the cluster grows by a factor of x, at some future point when the data set has also grown by a factor of x the balance factor is the ratio of the smallest node to the largest node.

In my use case it is impractical for any node to know the size of all the nodes. Load balancing decisions must be made probabilistically on limited data. Today I happened across this blog post, which is very applicable to my scenario. It presents a very good solution for load balancing between a fixed number of bins, but when adding nodes without taking the system offline it is useful to look at more than 2 random points to maximize balance factor.

I did some experiments on different values for x and n (where n is the number of random points examined), and experimentally determined an approximate equation for balance factor b:

b = 1-(1/x)^(n-1)

(by approximate I mean that it is close enough for the range of values I care about, which is x between 1.1 and 10 and n between 2 and 25. It might be exactly right but I don't have time to do a proof before band practice.)

If you actually want to use this though, you need to determine n given x and the desired b. With some manipulation we get:

n = 1 + log(1-b)/log(1/x)

Let me know if you found this useful.