Performance - ObrienlabsDev/blog GitHub Wiki

Performance

Experiments

Mandelbrot Set

option 1: Cuda or Metal GPUs

perf sec /run # GPUs % GPU Watts TDP Chip Cores GPU spec
11.7 23 .0092 2 99 904 94 AD-102 32768 dual RTX-4090 Ada (no NVLink (not used 48G))
5.85 46 .0092 1 99 452 94 AD-102 16384 RTX-4090 Ada 24G
3.44 78 .0312 2 99 388 97 GA-102 14336 dual RTX-A4500 with NVLink (not used) 40G
2.66 100 .02 1 99 304 102 GA-102 10752 RTX-A6000 48G
2.56 191 .0382 1 99 102 ? AD-104 5120 RTX-3500 Ada 12G Thermal Throttling
1.72 156 .0312 1 99 194 97 GA-102 7168 RTX-A4500 20G old
1.29 208 .0416 1 99 143 102 GA-104 6144 RTX-A4000 16G old
1.16 231 .0462 1 98 120 ? M4 Max 40 5120 Macbook Pro 16 M4Max 48G
1 269 .0538 1 99 105 ? TU-104 3072 RTX-5000 16G
0.78 344 .0688 2 97 120 ? M2 Ultra 60 7680 Mac Studio 2 M2Ultra 64G
0.47 571 .1142 1 79-98 ? M4 Pro 16 2048 Mac Mini M4 Pro 24G
0.39 693 .1386 1 95 ? M1 Max 32 4096 Macbook Pro 16 M1Max 32G

Collatz | Hailstone numbers | 3n+1 problem

option 3: Java 8 lambda/streams parallelization

see https://github.com/ObrienlabsDev/performance/issues/19

public void searchCollatzParallel(long oddSearchCurrent, long secondsStart) {
	long batchBits = 5; // adjust this based on the chip architecture 
	long searchBits = 32;
	long batches = 1 << batchBits;
	long threadBits = searchBits - batchBits;
	long threads = 1 << threadBits;
		
	for (long part = 0; part < (batches + 1) ; part++) {	
	    // generate a limited collection for the search space - 32 is a good
		System.out.println("Searching: " + searchBits + " space, batch " + part + " of " 
			+ batches + " with " + threadBits +" bits of " + threads + " threads"  );
			
		List<Long> oddNumbers = LongStream
					.range(1L + (part * threads), ((1 + part) * threads) - 1)
					.filter(x -> x % 2 != 0) // TODO: find a way to avoid this filter using range above
					.boxed()
					.collect(Collectors.toList());
			
		List<Long> results = oddNumbers
				.parallelStream()
				.filter(num -> isCollatzMax(num.longValue(), secondsStart))
				.collect(Collectors.toList());

		results.stream().sorted().forEach(x -> System.out.println(x));
	}
	System.out.println("last number: " + ((1 + (batches) * threads) - 1));
}

20241212: Observation

  • Curiously - running VMs are around 10-25% faster than running native (edit - may be differences on OpenJDK and commercial JDK 21)
  • 13900KS is still faster than the M4 for single core
  • M4 Max is more than double the throughput than the 32 thread 13900/13900
  • M4 Max 40 core GPU is around half the speed of a comparable NVidia RTX-3500 Ada generation mobile card - both of which have 5120 cores
  • https://github.com/obrienlabs/benchmark/issues/12
⚠️ **GitHub.com Fallback** ⚠️