Performance - ObrienlabsDev/blog GitHub Wiki
- performance testing via single/multithreaded, gpu code - https://github.com/ObrienlabsDev/blog/issues/91
- gpu performance - https://github.com/obrienlabs/benchmark/issues/13
- cpu performance - https://github.com/ObrienlabsDev/performance/tree/main/cpu https://github.com/ObrienlabsDev/cuda/blob/main/add_example/kernel_collatz.cu https://github.com/obrienlabs/benchmark/issues/12
| perf | sec | /run | # GPUs | % GPU | Watts | TDP | Chip | Cores | GPU spec |
|---|---|---|---|---|---|---|---|---|---|
| 11.7 | 23 | .0092 | 2 | 99 | 904 | 94 | AD-102 | 32768 | dual RTX-4090 Ada (no NVLink (not used 48G)) |
| 5.85 | 46 | .0092 | 1 | 99 | 452 | 94 | AD-102 | 16384 | RTX-4090 Ada 24G |
| 3.44 | 78 | .0312 | 2 | 99 | 388 | 97 | GA-102 | 14336 | dual RTX-A4500 with NVLink (not used) 40G |
| 2.66 | 100 | .02 | 1 | 99 | 304 | 102 | GA-102 | 10752 | RTX-A6000 48G |
| 2.56 | 191 | .0382 | 1 | 99 | 102 | ? | AD-104 | 5120 | RTX-3500 Ada 12G Thermal Throttling |
| 1.72 | 156 | .0312 | 1 | 99 | 194 | 97 | GA-102 | 7168 | RTX-A4500 20G old |
| 1.29 | 208 | .0416 | 1 | 99 | 143 | 102 | GA-104 | 6144 | RTX-A4000 16G old |
| 1.16 | 231 | .0462 | 1 | 98 | 120 | ? | M4 Max 40 | 5120 | Macbook Pro 16 M4Max 48G |
| 1 | 269 | .0538 | 1 | 99 | 105 | ? | TU-104 | 3072 | RTX-5000 16G |
| 0.78 | 344 | .0688 | 2 | 97 | 120 | ? | M2 Ultra 60 | 7680 | Mac Studio 2 M2Ultra 64G |
| 0.47 | 571 | .1142 | 1 | 79-98 | ? | M4 Pro 16 | 2048 | Mac Mini M4 Pro 24G | |
| 0.39 | 693 | .1386 | 1 | 95 | ? | M1 Max 32 | 4096 | Macbook Pro 16 M1Max 32G |
see https://github.com/ObrienlabsDev/performance/issues/19
public void searchCollatzParallel(long oddSearchCurrent, long secondsStart) {
long batchBits = 5; // adjust this based on the chip architecture
long searchBits = 32;
long batches = 1 << batchBits;
long threadBits = searchBits - batchBits;
long threads = 1 << threadBits;
for (long part = 0; part < (batches + 1) ; part++) {
// generate a limited collection for the search space - 32 is a good
System.out.println("Searching: " + searchBits + " space, batch " + part + " of "
+ batches + " with " + threadBits +" bits of " + threads + " threads" );
List<Long> oddNumbers = LongStream
.range(1L + (part * threads), ((1 + part) * threads) - 1)
.filter(x -> x % 2 != 0) // TODO: find a way to avoid this filter using range above
.boxed()
.collect(Collectors.toList());
List<Long> results = oddNumbers
.parallelStream()
.filter(num -> isCollatzMax(num.longValue(), secondsStart))
.collect(Collectors.toList());
results.stream().sorted().forEach(x -> System.out.println(x));
}
System.out.println("last number: " + ((1 + (batches) * threads) - 1));
}
- Curiously - running VMs are around 10-25% faster than running native (edit - may be differences on OpenJDK and commercial JDK 21)
- 13900KS is still faster than the M4 for single core
- M4 Max is more than double the throughput than the 32 thread 13900/13900
- M4 Max 40 core GPU is around half the speed of a comparable NVidia RTX-3500 Ada generation mobile card - both of which have 5120 cores
- https://github.com/obrienlabs/benchmark/issues/12