Performance

performance testing via single/multithreaded, gpu code - https://github.com/ObrienlabsDev/blog/issues/91
gpu performance - https://github.com/obrienlabs/benchmark/issues/13
cpu performance - https://github.com/ObrienlabsDev/performance/tree/main/cpu https://github.com/ObrienlabsDev/cuda/blob/main/add_example/kernel_collatz.cu https://github.com/obrienlabs/benchmark/issues/12

Experiments

Mandelbrot Set

option 1: Cuda or Metal GPUs

https://github.com/ObrienlabsDev/fractals

perf	sec	/run	# GPUs	% GPU	Watts	TDP	Chip	Cores	GPU spec
11.7	23	.0092	2	99	904	94	AD-102	32768	dual RTX-4090 Ada (no NVLink (not used 48G))
5.85	46	.0092	1	99	452	94	AD-102	16384	RTX-4090 Ada 24G
3.44	78	.0312	2	99	388	97	GA-102	14336	dual RTX-A4500 with NVLink (not used) 40G
2.66	100	.02	1	99	304	102	GA-102	10752	RTX-A6000 48G
2.56	191	.0382	1	99	102	?	AD-104	5120	RTX-3500 Ada 12G Thermal Throttling
1.72	156	.0312	1	99	194	97	GA-102	7168	RTX-A4500 20G old
1.29	208	.0416	1	99	143	102	GA-104	6144	RTX-A4000 16G old
1.16	231	.0462	1	98	120	?	M4 Max 40	5120	Macbook Pro 16 M4Max 48G
1	269	.0538	1	99	105	?	TU-104	3072	RTX-5000 16G
0.78	344	.0688	2	97	120	?	M2 Ultra 60	7680	Mac Studio 2 M2Ultra 64G
0.47	571	.1142	1	79-98		?	M4 Pro 16	2048	Mac Mini M4 Pro 24G
0.39	693	.1386	1	95		?	M1 Max 32	4096	Macbook Pro 16 M1Max 32G

Collatz | Hailstone numbers | 3n+1 problem

option 3: Java 8 lambda/streams parallelization

see https://github.com/ObrienlabsDev/performance/issues/19

public void searchCollatzParallel(long oddSearchCurrent, long secondsStart) {
	long batchBits = 5; // adjust this based on the chip architecture 
	long searchBits = 32;
	long batches = 1 << batchBits;
	long threadBits = searchBits - batchBits;
	long threads = 1 << threadBits;
		
	for (long part = 0; part < (batches + 1) ; part++) {	
	    // generate a limited collection for the search space - 32 is a good
		System.out.println("Searching: " + searchBits + " space, batch " + part + " of " 
			+ batches + " with " + threadBits +" bits of " + threads + " threads"  );
			
		List<Long> oddNumbers = LongStream
					.range(1L + (part * threads), ((1 + part) * threads) - 1)
					.filter(x -> x % 2 != 0) // TODO: find a way to avoid this filter using range above
					.boxed()
					.collect(Collectors.toList());
			
		List<Long> results = oddNumbers
				.parallelStream()
				.filter(num -> isCollatzMax(num.longValue(), secondsStart))
				.collect(Collectors.toList());

		results.stream().sorted().forEach(x -> System.out.println(x));
	}
	System.out.println("last number: " + ((1 + (batches) * threads) - 1));
}

20241212: Observation

Curiously - running VMs are around 10-25% faster than running native (edit - may be differences on OpenJDK and commercial JDK 21)
13900KS is still faster than the M4 for single core
M4 Max is more than double the throughput than the 32 thread 13900/13900
M4 Max 40 core GPU is around half the speed of a comparable NVidia RTX-3500 Ada generation mobile card - both of which have 5120 cores
https://github.com/obrienlabs/benchmark/issues/12

Performance - ObrienlabsDev/blog GitHub Wiki

Performance

Experiments

Mandelbrot Set

option 1: Cuda or Metal GPUs

Collatz | Hailstone numbers | 3n+1 problem

option 3: Java 8 lambda/streams parallelization

20241212: Observation

⚠️ GitHub.com Fallback ⚠️

Performance - ObrienlabsDev/blog GitHub Wiki

Performance

Experiments

Mandelbrot Set

option 1: Cuda or Metal GPUs

Collatz | Hailstone numbers | 3n+1 problem

option 3: Java 8 lambda/streams parallelization

20241212: Observation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️