Multithreaded arithmetic benchmark

BurntPizza · March 17, 2014, 8:36pm

Was curious about scalability/practicality of threading arithmetic over large arrays.
Put what ever you want in op() to measure your own system! Post results!

Also tell me if I did anything dumb in the bench code, wanna make sure I’m doing that right. :point:

import java.util.*;
import java.util.concurrent.*;

public class ArrayThreadingTest {
	
	public static void main(String[] args) throws InterruptedException {
		int numCores = Runtime.getRuntime().availableProcessors();
		ExecutorService exec = Executors.newFixedThreadPool(numCores);
		
		System.out.println("Warming up... " + numCores + " cores detected");
		for (int n = 0; n < 2; n++)
			for (int i = 3; i <= 7; i++) {
				final int numThreads = numCores;
				final int testSize = (int) Math.pow(10, i);
				final int[] a = new int[testSize];
				final int[] b = new int[testSize];
				final int[] c = new int[testSize];
				
				Arrays.fill(a, 56456810);
				Arrays.fill(b, 74779);
				
				List<Callable<Void>> tasks = new ArrayList<Callable<Void>>();
				for (int y = 0; y < numThreads; y++) {
					final int h = y;
					tasks.add(new Callable<Void>() {
						@Override
						public Void call() throws Exception {
							for (int z = h * testSize / numThreads; z < (h + 1) * testSize / numThreads; z++) {
								c[z] = op(a[z], b[z]);
							}
							return null;
						}
					});
				}
				exec.invokeAll(tasks);
			}
		
		System.out.println("Performing tests...\n");
		
		for (int n = 1; n <= numCores; n++) {
			for (int i = 3; i <= 8; i++) {
				final int numThreads = n;
				exec.shutdownNow();
				exec.awaitTermination(1, TimeUnit.SECONDS);
				exec = Executors.newFixedThreadPool(numThreads);
				final int testSize = (int) Math.pow(10, i);
				final int[] a = new int[testSize];
				final int[] b = new int[testSize];
				final int[] c = new int[testSize];
				
				Arrays.fill(a, 56456810);
				Arrays.fill(b, 74779);
				
				List<Callable<Void>> tasks = new ArrayList<Callable<Void>>();
				for (int y = 0; y < numThreads; y++) {
					final int h = y;
					tasks.add(new Callable<Void>() {
						@Override
						public Void call() {
							for (int z = h * testSize / numThreads; z < (h + 1) * testSize / numThreads; z++) {
								c[z] = op(a[z], b[z]);
							}
							return null;
						}
					});
				}
				
				long time = System.nanoTime();
				
				exec.invokeAll(tasks);
				
				time = System.nanoTime() - time;
				
				int errors = 0;
				final int desiredResult = op(56456810, 74779);
				for (int u = 0; u < c.length; u++) {
					if (c[u] != desiredResult) errors++;
				}
				if (errors > 0) System.out.println("\t" + errors * 100. / testSize + "% errors!");
				double nsPerOp = (double) time / testSize;
				double opsPerSecondPerThread = testSize / (time / 1000000000d) / numThreads;
				System.out.printf("Finished: %d Thread" + (numThreads > 1 ? "s" : "") + " performing %s operations at %.2f ns/op (%dK op/s/thread)\n", numThreads, 10 + "^" +(int) Math.log10(testSize), nsPerOp, (int) (opsPerSecondPerThread / 1000));
			}
			System.out.println();
		}
		System.out.println("All tests finished, shutting down...");
		exec.shutdownNow();
		exec.awaitTermination(3, TimeUnit.SECONDS);
	}
	
	//give your ALU a workout!
	//default: testing small loop optimization and integer arithmetic
	private static final int op(int a, int b) {
		int r = 0;
		for (int i = 1; i <= 10; i++)
			r += ((a + b) * (a / b + b + a) & i % b << a / b + a - 7 * b + (a - b + a * b ^ b & a)) % (a * 7 + b * (b / a + 5 ^ a * b) & b - a) * 8 + (i + 1) / a * a ^ b * a << b / b >> a;
		return r;
	}
}

Yields this on an AMD Phenom II x4 955 (3.4Ghz quad), Java 7:

Warming up… 4 cores detected
Performing tests…

Finished: 1 Thread performing 10^3 operations at 645.33 ns/op (1549K op/s/thread)
Finished: 1 Thread performing 10^4 operations at 452.42 ns/op (2210K op/s/thread)
Finished: 1 Thread performing 10^5 operations at 383.04 ns/op (2610K op/s/thread)
Finished: 1 Thread performing 10^6 operations at 375.21 ns/op (2665K op/s/thread)
Finished: 1 Thread performing 10^7 operations at 376.13 ns/op (2658K op/s/thread)
Finished: 1 Thread performing 10^8 operations at 377.05 ns/op (2652K op/s/thread)

Finished: 2 Threads performing 10^3 operations at 528.98 ns/op (945K op/s/thread)
Finished: 2 Threads performing 10^4 operations at 213.94 ns/op (2337K op/s/thread)
Finished: 2 Threads performing 10^5 operations at 189.71 ns/op (2635K op/s/thread)
Finished: 2 Threads performing 10^6 operations at 189.00 ns/op (2645K op/s/thread)
Finished: 2 Threads performing 10^7 operations at 188.49 ns/op (2652K op/s/thread)
Finished: 2 Threads performing 10^8 operations at 188.90 ns/op (2646K op/s/thread)

Finished: 3 Threads performing 10^3 operations at 579.33 ns/op (575K op/s/thread)
Finished: 3 Threads performing 10^4 operations at 157.18 ns/op (2120K op/s/thread)
Finished: 3 Threads performing 10^5 operations at 128.45 ns/op (2595K op/s/thread)
Finished: 3 Threads performing 10^6 operations at 125.93 ns/op (2647K op/s/thread)
Finished: 3 Threads performing 10^7 operations at 125.51 ns/op (2655K op/s/thread)
Finished: 3 Threads performing 10^8 operations at 125.86 ns/op (2648K op/s/thread)

Finished: 4 Threads performing 10^3 operations at 681.02 ns/op (367K op/s/thread)
Finished: 4 Threads performing 10^4 operations at 133.96 ns/op (1866K op/s/thread)
Finished: 4 Threads performing 10^5 operations at 97.46 ns/op (2565K op/s/thread)
Finished: 4 Threads performing 10^6 operations at 158.06 ns/op (1581K op/s/thread)
Finished: 4 Threads performing 10^7 operations at 106.08 ns/op (2356K op/s/thread)
Finished: 4 Threads performing 10^8 operations at 96.27 ns/op (2596K op/s/thread)

All tests finished, shutting down…

Spasi · March 17, 2014, 9:53pm

I highly recommend JMH for all kinds of benchmarking, for many important reasons.

BurntPizza · March 17, 2014, 10:13pm

Ah, thanks for that, it looks impressive. I should really pay more attention to OpenJDK projects.

I realize it’s very difficult to make a good microbenchmark in Java due to the unpredictable nature of the JIT, but wanted to attempt it anyway, taking any results with a grain or five of salt, plus give anyone here an example snippet for simple multithreaded number crunching, though I don’t claim to have the best methodology. Lately I’ve seen too many posts about “should I multithread blah,” quickly followed by “why is stuff broken?”