Testing AMD 1055T versus A10 7850K - wyldckat/wyldckat.github.io GitHub Wiki
Back in 2011 I posted the following on my blog at CFD-Online: Build times for OpenFOAM 2.0.x code with Ubuntu 10.10 with its gcc 4.4.5. This was when I had bought my AMD Phenom 1055T earlier that year and was happily treading along with a neat and low cost 6 core CPU.
Ever since AMD launched their A10-7850K APU back in mid January 2014, I've been eyeing the prices and saving up to a point where I could say "yep, I need a new CPU and more RAM, so it's time to purchase this wonder of technology". That time has come earlier this December 2015, because I wanted to help people on the forum and I needed more than 6GB of RAM than I had with my 1055T build.
Only around Christmas (25th of December), did I finally manage to build the new machine and transfer things around, but I also did some tests to see how well each machine performs. Note: I essentially simply changed the motherboard, CPU, RAM and PSU, while keeping a remaining barebones of the old machine.
In the CPU Specifications chapter I outline the specs for each machine build and what I expect that each one should perform. In the section Estimated performances are my estimated performances, based on online benchmark values and each one of the machines' specs... which is where you will see why I chose the A10-7850K for my personal home computer.
In the chapter Performance tests are the actual performance results.
Reference pages:
- http://www.amd.com/en-us/products/processors/desktop/phenom-ii
- http://www.cpu-world.com/CPUs/K10/AMD-Phenom%20II%20X6%201055T%20-%20HDT55TFBK6DGR%20%28HDT55TFBGRBOX%29.html
- 6 cores
- L1 cache:
- 6 x 64 KB 2-way set associative instruction caches
- 6 x 64 KB 2-way set associative data caches
- L2 cache: 3MB
- 6 x 512 KB 16-way set associative exclusive caches
- L3 cache: 6MB
- Shared 6 MB 48-way set associative cache
- TDP: 125W
- Lithography: 45nm SOI
- Frequency (not overclocked):
- Max: 2800 MHz (when using all 6 cores)
- Min: 800 MHz
- 2 memory communication channels
- 6 GB RAM DDR2 (3 modules, 1 unpaired) at 800 MHz
- 512 MB shared with integrated GPU in motherboard
Reference pages:
- http://www.amd.com/en-us/products/processors/desktop/a-series-apu
- http://www.cpu-world.com/CPUs/Bulldozer/AMD-A10-Series%20A10-7850K.html
- 4 cores + 8 GPU stream cores (8*64 = 512 Shaders total)
- Note: has AVX and HSA support
- L1 cache:
- 2 x 96 KB 3-way set associative shared instruction caches
- 4 x 16 KB 4-way set associative data caches
- L2 cache: 4 MB
- 2 x 2 MB 16-way set associative shared caches
- TDP: 95W
- Lithography: 28nm SHP
- CPU Frequency (not overclocked):
- Max: 3700 MHz (when using all 4 cores)
- Min: 1700 MHz
- GPU Frequency (not overclocked): 720 MHz
- 2 memory communication channels
- 16 GB RAM DDR3 (2 modules) at 2133 MHz
- 2048 MB shared with the integrated GPU in CPU die
If we compare the two CPU indexes here: http://www.cpubenchmark.net/cpu_list.php - we get the following numbers:
- AMD Phenom 1055T: 5059
- AMD A10-7850K: 5568
- AMD Phenom 1055T:
- 45nm to 28nm implies that the 1055T would have 2800 MHz as an equivalent frequency downscaled to 1742 MHz.
- Which times 6, would be equivalent to roughly 10452 MHz cumulative frequency.
- AMD A10-7850K:
- Has 3700 MHz at 28 nm, therefore the 4 cores should result in a rough equivalent of 14800 MHz cumulative frequency.
So, what gives? Well, if you take a look at the cache L1 and L2 levels, you'll notice that:
- the 1055T has 6 cache regions, one per core;
- the A10-7850K has 2 or 4 cache regions, depending on the type or use scenario, which equates to having 2 cores sharing the same cache region in most situations.
The pros and cons of such a strategy are usually these:
- Pros:
- Fewer transistors needed when sharing integrated components, such as cache.
- Tasks are essentially scheduled and executed simultaneously whenever possible.
- Cons:
- Nearly impossible for each independent core to be able to perform its own mathematical operation on its own.
- 2 cores at 3700MHz have an equivalent 7400MHz cumulative frequency;
- which boosted in 30% with the duplicated core components, equates to 9620 MHz...
- OK, if we do the opposite math, 10452 * 1.10 = 11497 MHz, which means the 4 cores in the A10-7850K deliver around 55% boost over the base 2 cores.
- Remember how CUDA has all the buzz nowadays of being the go-to compilation methodology for employing NVidia GPUs for number crunching?
- Well AMD/ATI cards have better number crunching capabilities than NVidia, so much that Bitcoin mining had made AMD/ATI cards more expensive some time ago, because they were the best bang for your buck when it came to bitcoin mining: Non-specialized hardware comparison - Note: this was before the ASIC became the comercial miners.
- For example, the AMD Radeon HD 6990 outputs around 800 to 860 Mhash/s, while the comparatively the NVidia GTX 680 outputs around 110 to 120 Mhash/s.
- More details: Why a GPU mines faster than a CPU
- GPU at 720 MHz with 512 shaders results in an equivalent cumulative frequency of 512 * 720 = 368640 MHz
- Comparing with the 1055T, this is roughly 35 times more potential mathematical power, but mathematical only.
But it's not all a sea of roses:
- Today's GPU cards have GDDR5 RAM at 7000MHz.
- The A10-7850K using the maximum stock clock for DDR3 RAM interface, is only at 2133MHz.
- The A10-7850K can only use this feature if an HSA compiler is used, along with updated/adapted source code can use the HSA features.
- It has got a videocard benchmark index of 1005, versus 84 on the integrated Radeon HD 3200 the motherboard for the 1055T had.
- Which means I can play with higher fidelity (3/5-year old) games.
- I can try and build HSA support into OpenFOAM and check how it compares to the GPGPU implementations :)
Building OpenFOAM 3.0.x (commit `195caf7479b2`), with ThirdParty-3.0.x already built:
- AMD Phenom 1055T - Ubuntu 12.04 (GCC 4.6.3):
~/OpenFOAM/OpenFOAM-3.0.x$ time ./Allwmake -j > log.make 2>&1
real 50m48.133s user 236m13.122s sys 9m18.259s
- AMD A10-7850K x4 - Ubuntu 15.10 (GCC 5.2.1):
~/OpenFOAM/OpenFOAM-3.0.x$ time ./Allwmake -j > log.make 2>&1
real 63m52.444s user 217m52.936s sys 6m25.872s
There are a few details at hand here:
- GCC 5.2.1 does more optimization and complaining than GCC 4.6.3.
- Nonetheless, even though the A10-7850K took longer to build in real time, but it took lesser time in user and sys timings, making it more efficient than the 1055T.
- Both builds are slower than the 36 minutes it took to build OpenFOAM 2.0.x back in 2011, but the base code is now considerably larger and more template based, which requires more processing by the compiler to properly compile and optimize things.
The next day I tested building OpenFOAM-dev with only 2 cores on the A10-7850K and I got this:
$ time ./Allwmake -j 2 > log.make 2>&1
real 95m43.559s user 177m2.468s sys 5m9.592s
Which is we do the math gets us: 96 / 64 = 1.5 is essentailly the estimated ratio, based on the aforementioned 2 cores at 3700MHz have an equivalent 7400MHz cumulative frequency and if we do the opposite math, 10452 * 1.10 = 11497 MHz, which means the 4 cores in the A10-7850K deliver around 55% boost over the base 2 cores.
The case used is shared here: https://github.com/wyldckat/wyldckat.github.io/tree/1055T-vs-7850K
- The case folder: cavity300k
- The log files for each run:
Edit the script `DoALap` to set the desired number of cores and then run it like this:
./DoALap
Summary of results:
- 1055T:
- 1 core: 203 s
- 2 cores: 151 s
- 4 cores: 123 s
- 6 cores: 118 s
- A10-7850K:
- 1 core: 117 s
- 2 cores: 82 s
- 4 cores: 78 s
- Where the 1055T 6 cores left off at 118 s, the A10-7850K picked up with a single core at 117 s. Reasons:
- RAM speed matters: 800 MHz vs 2133 MHz equates to nearly 2.7 times faster memory access.
- Although keep in mind that the unbalanced RAM distribution in the machine with the 1055T CPU might also have influenced the result.
- 5 to 10% performance boost can be expected from the differences in GCC versions.
- RAM speed matters: 800 MHz vs 2133 MHz equates to nearly 2.7 times faster memory access.
- The 1055T had a very hard time scaling up with the core count. This can easily have been due to a bottleneck from the RAM interface, either due to the 800 MHz speed or the unbalanced memory module pairing.
- A10-7850K did 1.4 times faster with 2 cores than 1 core.
- But using 4 cores was only marginally better, so much that it's within a plausable margin of error.
- The A10-7850K with 2 cores was rougly 1.4 times the performance of the 1055T with 6 cores.
- Which ironically is essentially the initial estimation made a previous section Estimated performances.
- Assuming the two machines were operating at top speed with all cores, the performance ratio between them is 118/78 = 1.51
The case used is shared here: https://github.com/wyldckat/wyldckat.github.io/tree/1055T-vs-7850K
- The case folder: cavity1M
- The log files for each run on the A10-7850K
Summary of results:
- A10-7850K:
- 1 core: 318 s
- 2 cores: 223 s
- 4 cores: 220 s
Also tested with 2x2 and 1x4, instead of 4x1 decompositions and the results were the same with the 4 cores, no improvement... well, actually, it was a bit slower, around 230 s.
- For hard-core CFD mathematics, the x86_64 processing units in A10-7850K act as Intel's HT technology, where only the real cores really count.
- This means that the 2 extra cores simply work as schedulers for mathematical operations, which is why there is a slight performance increase when using all 4 cores.
- In addition, keep in mind that when using only 2 cores, there should be a performance boost from the turbo technology that these CPUs have got. Which according to the page at cpu-world.com for the A10-7850K, means that 2 cores will operate at 3900MHz, which equates to 3900/3700 = 1.054, so roughly 5% more performance.
- In other words, the 82s with 2 cores should have been 86s, resulting in a speed up of 86/78 = 1.10 when using the 4 cores versus 2 cores at 3700MHz. Which is within the usual HT technology performance boost that can be observed in many situations.
- The RAM was very likely limiting the capabilities on the 1055T processor, therefore further tests should be conducted, although I don't have the hardware to do it myself.
- Using primarily x86_64 instructions, there are two ratios that are well accounted for between the estimates and the runs:
- A10-7850K will compile roughly 50% faster when using the 4 cores instead of just 2. This is on-par with the estimate made based on the performance indexes from cpubenchmark.net and the estimated cumulative frequency.
- The A10-7850K should be theoretically almost 50% faster, not just roughly 10% faster: 14800/10452 ~= 1.42, which was reflected in the performance ratio between them is 118/78 = 1.51 when running `icoFoam` in parallel.
- The discrepancy can be accounted for with the low performing memory installed in the machine with the 1055T CPU.
- Actually, there is another factor that should be accounted for: the number of memory channels. It's possible that only having 2 memory channels in both CPUs is why they are bottlenecking so much after the 2 core count.
- A test needs to be made where only memory copying is done, in order to assess the speed at which it is able to copy memory from one 1GB array to another 1GB array, since this will give us the ratio of performance that affects these CPUs.
I've done some tests with AVX versus the standard FPU. The results are available here: https://github.com/wyldckat/avxtest/wiki/A10-7850K-with-DDR3-at-2133-MHz