Fastest binary search on 4GB worth of long[]

Riven · April 7, 2016, 6:49pm

I had an hour to spare, and decided to write and optimize something that I did not need.
I decided to try to make Arrays.binarySearch(…) less cache-trashy.

I also changed the specification of my binarySearch algorithm a bit, to make it behave more like a traversable database-index, by demanding it must return the lowest index of a specific value in the array, in case the value occurs multiple times.

My initial (hierarchial) approach shows (on average) performance 3.1x as fast as a naive implementation:


Creating data-set...
Generating random values...
Sorting data-set...
Generating lookup indices...
Generating lookup tables...
Printing table lengths...
67108864
2097152
65536
2048
64
Performing sanity-checks...
Starting benchmark...

...

norm took 955ms (100%, 1097K ops/s)
rec1 took 610ms (156%, 1718K ops/s)
rec2 took 523ms (182%, 2004K ops/s)
rec3 took 505ms (189%, 2076K ops/s)
rec4 took 388ms (246%, 2702K ops/s)

...

4GB: (die hard mode)


...

norm took 12564ms (100%, 667K ops/s)
rec1 took 8137ms (154%, 1030K ops/s)
rec2 took 5306ms (236%, 1580K ops/s)
rec3 took 4962ms (253%, 1690K ops/s)
rec4 took 4137ms (303%, 2027K ops/s)

...

Die-hards are supposed to set [icode]isDiehard[/icode] to [icode]true[/icode].
Who feels like wasting some time squeezing even more single threaded performance out of this puppy?
Low hanging fruits are frowned upon, but graciously incorporated!


import java.util.Arrays;
import java.util.Random;

public class QuickBinarySearch {
	public static void main(String[] args) {
		Random rndm = new Random(1235L);

		boolean isDiehard = false; // 4 GB (512 MB for whips)

		System.out.println("Creating data-set...");
		long[] somedata = new long[(isDiehard ? 512 : 64) * 1024 * 1024];

		System.out.println("Generating random values...");
		for (int i = 0; i < somedata.length; i++) {
			somedata[i] = Math.abs(rndm.nextLong()) % somedata.length * 8;
		}

		System.out.println("Sorting data-set...");
		Arrays.sort(somedata);

		System.out.println("Generating lookup indices...");
		long[] lookups = new long[somedata.length / 64];
		for (int i = 0; i < lookups.length; i++) {
			lookups[i] = somedata[rndm.nextInt(somedata.length)] + rndm.nextInt(10) - 5;
		}

		System.out.println("Generating lookup tables...");

		final int shift = 5;
		long[] reduced1 = new long[somedata.length >> (shift * 1)];
		long[] reduced2 = new long[somedata.length >> (shift * 2)];
		long[] reduced3 = new long[somedata.length >> (shift * 3)];
		long[] reduced4 = new long[somedata.length >> (shift * 4)];

		reduce(somedata, reduced1, shift);
		reduce(reduced1, reduced2, shift);
		reduce(reduced2, reduced3, shift);
		reduce(reduced3, reduced4, shift);

		System.out.println("Printing table lengths...");
		System.out.println(somedata.length);
		System.out.println(reduced1.length);
		System.out.println(reduced2.length);
		System.out.println(reduced3.length);
		System.out.println(reduced4.length);

		System.out.println("Performing sanity-checks...");
		{
			for (int i = -1_000_000; i <= 1_000_000; i++) {
				int i1 = binarySearchLo(somedata, i);
				int i2 = binarySearchLo(somedata, reduced1, shift, i);
				int i3 = binarySearchLo(somedata, reduced1, reduced2, shift, i);
				int i4 = binarySearchLo(somedata, reduced1, reduced2, reduced3, shift, i);
				int i5 = binarySearchLo(somedata, reduced1, reduced2, reduced3, reduced4, shift, i);
				if (i1 != i2)
					throw new IllegalStateException();
				if (i1 != i3)
					throw new IllegalStateException();
				if (i1 != i4)
					throw new IllegalStateException();
				if (i1 != i5)
					throw new IllegalStateException();
			}
		}

		int dummy = 0;
		System.out.println("Starting benchmark...");
		for (int k = 0; k < 8; k++) {
			System.out.println();
			long normTook;
			{
				long t0 = System.nanoTime();
				for (long lookup : lookups) {
					dummy += binarySearchLo(somedata, lookup);
				}
				long t1 = System.nanoTime();
				long took = (t1 - t0) / 1_000_000L;
				System.out.println("norm took " + took + "ms (100%, " + (lookups.length / took) + "K ops/s)");
				normTook = took;
			}
			{
				long t0 = System.nanoTime();
				for (long lookup : lookups) {
					dummy += binarySearchLo(somedata, reduced1, shift, lookup);
				}
				long t1 = System.nanoTime();
				long took = (t1 - t0) / 1_000_000L;
				System.out.println("rec1 took " + took + "ms (" + (int) (normTook * 100.0f / took) + "%, "
						+ (lookups.length / took) + "K ops/s)");
			}
			{
				long t0 = System.nanoTime();
				for (long lookup : lookups) {
					dummy += binarySearchLo(somedata, reduced1, reduced2, shift, lookup);
				}
				long t1 = System.nanoTime();
				long took = (t1 - t0) / 1_000_000L;
				System.out.println("rec2 took " + took + "ms (" + (int) (normTook * 100.0f / took) + "%, "
						+ (lookups.length / took) + "K ops/s)");
			}
			{
				long t0 = System.nanoTime();
				for (long lookup : lookups) {
					dummy += binarySearchLo(somedata, reduced1, reduced2, reduced3, shift, lookup);
				}
				long t1 = System.nanoTime();
				long took = (t1 - t0) / 1_000_000L;
				System.out.println("rec3 took " + took + "ms (" + (int) (normTook * 100.0f / took) + "%, "
						+ (lookups.length / took) + "K ops/s)");
			}
			{
				long t0 = System.nanoTime();
				for (long lookup : lookups) {
					dummy += binarySearchLo(somedata, reduced1, reduced2, reduced3, reduced4, shift, lookup);
				}
				long t1 = System.nanoTime();
				long took = (t1 - t0) / 1_000_000L;
				System.out.println("rec4 took " + took + "ms (" + (int) (normTook * 100.0f / took) + "%, "
						+ (lookups.length / took) + "K ops/s)");
			}
		}
		System.out.println("dummy:" + dummy);
	}

	public static int binarySearchLo(long[] data1, long[] data2, long[] data3, long[] data4, long[] data5, int shift,
			long lookup) {
		int idx = binarySearchLo(data2, data3, data4, data5, shift, lookup);
		return binarySearchLoNextLevel(data1, idx, shift, lookup);
	}

	public static int binarySearchLo(long[] data1, long[] data2, long[] data3, long[] data4, int shift, long lookup) {
		int idx = binarySearchLo(data2, data3, data4, shift, lookup);
		return binarySearchLoNextLevel(data1, idx, shift, lookup);
	}

	public static int binarySearchLo(long[] data1, long[] data2, long[] data3, int shift, long lookup) {
		int idx = binarySearchLo(data2, data3, shift, lookup);
		return binarySearchLoNextLevel(data1, idx, shift, lookup);
	}

	public static int binarySearchLo(long[] data1, long[] data2, int shift, long lookup) {
		int idx = binarySearchLo(data2, lookup);
		return binarySearchLoNextLevel(data1, idx, shift, lookup);
	}

	public static long[] reduce(long[] data, int shift) {
		long[] lookup = new long[data.length >> shift];
		reduce(data, lookup, shift);
		return lookup;
	}

	public static void reduce(long[] src, long[] dst, int shift) {
		if (src.length != (dst.length << shift)) {
			throw new IllegalStateException();
		}
		for (int i = 0; i < dst.length; i++) {
			dst[i] = src[i << shift];
		}
	}

	private static int binarySearchLoNextLevel(long[] data, int idx, int shift, long lookup) {
		idx = (idx >> 31) ^ idx;

		// int min = Math.max(0, (idx - 1) << shift);
		int min = (idx - 1) << shift;
		min &= ~(min >> 31);

		int max = Math.min(data.length, (idx << shift) + 1);

		return binarySearchLo(data, min, max, lookup);
	}

	private static int binarySearchLo(long[] a, long key) {
		return binarySearchLo(a, 0, a.length, key);
	}

	private static int binarySearchLo(long[] a, int fromIndex, int toIndex, long key) {
		int low = fromIndex;
		int high = toIndex - 1;

		while (low <= high) {
			int idx = (low + high) >>> 1;
			long val = a[idx];

			if (val < key)
				low = idx + 1;
			else if (val > key)
				high = idx - 1;
			else if (idx != fromIndex && val == a[idx - 1])
				high = idx - 1; // find lower index
			else
				return idx; // lowest index with value
		}
		return -(low + 1);
	}
}

jonjava · April 7, 2016, 7:50pm

Ccoj5lhLmSQ

Riven · April 7, 2016, 7:53pm

Ah, it’s not that complex

Let’s say you do a binary search, and find your cache to be trashed.
Now shrink the ordered data-set, by taking every n-th value.
Perform the binary search on that smaller data-set.
You now have an index at which (after scaling up) you can expect the value to be in the larger data-set.
Now do a binary search on the larger data-set within a very small range.

Rinse and repeat, until the smallest data-sets fits into L1-cache, and the second in L2-cache.

Riven · April 8, 2016, 8:27am

I don’t want no stinkin’ medals, I want competition! :-*

Nobody’s up for a challenge? Nobody? Surely your fingers must be itching.

princec · April 8, 2016, 8:53am

I do this sort of thing all day long at work, and they pay me for it!

But as you’ve got a bit of spare time, how about you give us the raw speed of the naive search algorithm (that is, scan 4gb of sorted longs linearly in memory).

Cas

Riven · April 8, 2016, 9:08am

What I wouldn’t give for such a job! I’m integrating stuff into other stuff, and mapping stuff to stuff.
Meetings, presentations, politics, getting everybody to agree. Pays big bucks, drains the soul.


norm took 12564ms (100%, 667K ops/s) // <--- naive binarySearch, finding *lowest* index of value
rec1 took 8137ms (154%, 1030K ops/s)
rec2 took 5306ms (236%, 1580K ops/s)
rec3 took 4962ms (253%, 1690K ops/s)
rec4 took 4137ms (303%, 2027K ops/s) // <--- 4 layers of indirection

princec · April 8, 2016, 9:13am

That’s still a binary search time in norm isn’t it? I mean literally this:


for (int i = 0; i < in.length; i ++) if (in[i] == findThis) return i;

Cas

princec · April 8, 2016, 9:20am

Bah, only got 5GB physical RAM here at work. Can’t figure out a combination of commandline voodoo that will allow me to allocate 4GB.

Cas

Riven · April 8, 2016, 9:21am

Arg… I should learn to read. But then again, it’s a mad suggestion! :point:

Are you suspecting that the reduced cache-trashing of a linear search may be in the same ball-park as a cache-trashy binary search? Madness, I tell you!

I just added it, and it struggles with the ‘sanity-check’ already - not with correctness, but with performance. I’ll report back when my poor CPU finished the benchmark on the 512MB data-set, 4GB may take a few days.

princec · April 8, 2016, 9:35am

Yes, sheer madness I wonder at what size of data-set it becomes realistically sensible to do brute-force.

Cas

Riven · April 8, 2016, 9:37am

[tr][td]
Data-set
[/td]
[td]
Linear ops/s
[/td]
[td]
Naive binary search ops/s
[/td]
[td]
Hierarchical binary search ops/s
[/td]
[/tr]

[tr]
[td]
512 MB
[/td]
[td]
50 ops/sec
[/td]
[td]
1,097,000 ops/sec
[/td]
[td]
2,702,000 ops/sec
[/td]
[/tr]

[tr]
[td]
4096 MB
[/td]
[td]
6 ops/sec
[/td]
[td]
667,000 ops/sec
[/td]
[td]
2,027,000 ops/sec
[/td]
[/tr]

I killed the process… would have taken eons.

princec · April 8, 2016, 9:39am

Science! :o

Cas

Riven · April 8, 2016, 10:05am

The break even point for linear vs binary-search is:

drum roll


long[12]

norm took 3.757029ms (100%, 49906K op/s)
linr took 3.698955ms (101%, 50689K op/s)


	private static int linearSearchLo(long[] data, long key) {
		for (int i = 0; i < data.length; i++) {
			if (data[i] == key) {
				return i;
			}
			if (data[i] > key) {
				return -(i + 1);
			}
		}
		return -(data.length + 1);
	}

blobloblob · April 8, 2016, 10:07am

I would hash it at sorting and I would get O(1) performance XD

Riven · April 8, 2016, 10:09am

Big-O notation is about performance degradation over a growing data-set. O(1) can be dreadful.

Hashing is not so usable for 1:N indexes. You’d need a fairly large data-structure behind every hash-key, and how do you optimize that?

Also, you can’t traverse (in-order) a hashed data-structure.

Anyway, I’ll give it a shot, for direct lookups (when time permits).

blobloblob · April 8, 2016, 10:20am

That wasn’t actually for serious but maybe there is some way in crazyness.

Roquen · April 8, 2016, 10:22am

Depending on what the “important” operations are, there’s also bloom-filters.

Riven · April 8, 2016, 10:49am

Interesting concept.

There are no important operations though, as I don’t have a use-case yet where a factor 3 performance increase in memory access is actually meaningful. Whatever accesses this read-only ‘index’ is going to be the bottleneck.

BurntPizza · April 8, 2016, 1:05pm

I’d be curious as to where the break even point between hierarchical indices and something like binary search on level-ordered storage. Might not need as many levels in the hierarchy.

Cero · April 8, 2016, 2:29pm

You know, believe or not, no. Absolutely not
You got this!