There was some recent interest in what type of computer would run Fractal eXtreme’s deep zoom calculations fastest (and cheapest) so I wrote a benchmarking tool that isolates the inner calculations for more accurate measurements and arranged to have it run on a few different processors. Now I know how quickly many processors can execute the core block of high-precision math code. The results probably apply to cryptography as well.
Clock speed matters, hyperthreads are less helpful than I thought, and microarchitectural efficiency matters a lot.
Fractal calculations at floating-point precision (up to ~53 bits) aren’t very interesting anymore because they are already very fast. However when you zoom further into the Mandelbrot set and other fractals the computational load increases, roughly as the cube of the precision (up to over 7,000 bits in Fractal eXtreme). It eventually gets annoyingly slow.
As discussed on a previous post on this topic the core high-precision calculation in 64-bit Fractal eXtreme is this same block repeated again and again, with no loop overhead:
mov rax, [rsi+2*PtrSize]
mul QWORD PTR [rsi+8*PtrSize]
add r11, rax
adc r12, rdx
adc r13, 0
This code loads a 64-bit value, multiplies it by another 64-bit value, and then adds the 128-bit result to a 192-bit accumulator (r13:r12:r11). This building block, with different offsets each time, is repeated hundreds of times, with occasional interruptions to write a register to memory when a diagonal is finished. It takes 272 of these five-instruction sequences to square a 1,024 bit number and get a 1,024 bit result.
My test code has a hundred copies of this pasted into a function which it calls in a loop. It starts calling it on one thread at a time, and then gradually increases the number of threads up to the number of logical threads on the machine. The test takes about one second per logical thread, so typically four to twelve seconds on modern machines. Here are some results from an eight core FX-8150 AMD Bulldozer machine:
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU’s rdtsc speed is 4.228 GHz (peak speed may be higher).Running tests on 8 thread system.
Performance with 1 threads is 0.864 GBlocks/sec
Performance with 2 threads is 1.694 GBlocks/sec
Performance with 3 threads is 2.592 GBlocks/sec
Performance with 4 threads is 3.311 GBlocks/sec
Performance with 5 threads is 4.123 GBlocks/sec
Performance with 6 threads is 4.758 GBlocks/sec
Performance with 7 threads is 5.445 GBlocks/sec
Performance with 8 threads is 6.123 GBlocks/sec
Because the test program cycles through different numbers of threads, and because it can easily be used to run lots of tests, it was easy to tease out the results of the microarchitecture, hyperthreads, Turboboost, etc.
Note that these results are not particularly applicable to anything else. And since 32-bit high-precision math is at least four times slower I didn’t even bother measuring that. This is 64-bit only.
Microarchitecture
This one was easiest. By looking at the maximum single-threaded results on a range of different processors, sometimes monitoring their CPU frequencies with CPU-Z, I was able to figure out the throughput for the block of code, in cycles. For CPUs like Bulldozer it was easy – my test machine was locked to 4.2 GHz and the blocks-per-second results divided in quite neatly. It can process a block every five cycles.
The Core 2 processors managed one block every four cycles.
For my Sandybridge laptop it was more difficult because the rdtsc frequency is 2.2 GHz, but the processor can TurboBoost to 2.6 to 3.2 GHz depending on unpredictable factors. My Sandybridge desktop machine is more predictable and that helped to nail down the results at one block every three cycles.
This is really quite amazing. The latency of the Sandybridge 64×64 multiply instruction alone is three cycles, yet it manages to overlap subsequent blocks so well that it completes an entire new block every three cycles. Remarkable.
Here are the results in table form – the lower the “Clocks per block” the better:
| Processor | Clocks per block |
| Intel Sandybridge | 3.0 |
| AMD Opteron | 3.33 |
| Intel Core 2 | 4.0 |
| AMD Bulldozer | 5.0 |
| Intel Atom | 20.0/11.5 |
And now, some analysis of other factors.
Hyperthreads
They don’t help. Much. In this context. Except on Atom.
Hyperthreads are threads of execution running simultaneously on the same core. They share execution resources so if one of them is making full use of a set of resources then running the same code on the other hyperthread won’t increase throughput, because they will be contending for the same finite resources. In ‘normal’ situations they help fill in time wasted waiting on cache misses, they intermingle integer and FPU code, or they fill in gaps in code with long latencies. However deep-zoom fractal code has no cache misses, no FPU code, and no latency gaps that can’t be filled in by the out-of-order engine running multiple blocks in parallel.
On a dedicated fractal machine running nothing else there would be no advantage to hyperthreads. On a real computer, running occasional other random tasks, they help let the fractal calculations fill in the gaps left by other tasks, thus keeping the computer closer to 100% productivity. They don’t increase peak throughput, but they make it easier to keep a computer fully busy, and thereby avoid wasting ~10-25% of CPU time.
The one exception is the Intel Atom. Because this is an in-order machine it is unable to run adjacent blocks in parallel, so it ends up being instruction latency bound. Therefore, as Chad’s results in the comments section show, using two threads on one Atom core gives almost a 75% speedup. The hyperthreading allows pipeline bubbles to filled in that on other processors are filled in by out-of-order instruction execution. The net result, however, is still the worst per-core performance, which is not surprising given Atom’s design goal of low-power draw. An Atom core can process one block every 20 cycles, or one block every 11.5 cycles if two hyperthreads are running one one core.
Lots of cores
They help. If they are real cores with independent execution resources of the type that you need (Bulldozer cores count) then they should give linear performance gains.
Turboboost
Turboboost is complicated because it is very hard to tell what frequency your CPU will actually run at. You can monitor the frequency with various tools and try to correlate that to performance but it will vary. Laptop CPUs appear to be the worst. My laptop CPU has a nominal frequency of 2.2 GHz but can Turboboost up to 3.3 GHz. More commonly it seems to peak, under heavy load, at around 2.9 GHz at first and then drops to 2.6 GHz. This makes benchmarking and estimating very tricky. The nominal and maximum rates don’t matter – only the fully loaded rate matters.
Decoding CPU ID
Making sense of the results requires translating the CPU ID information obtained by the benchmarking program into real CPU model numbers. I did that with help from the people running the tests and various Internet resources. This is what I have. Let me know if it is wrong:
- 0F_03H, 0F_04H, 0F_06H, 06_0EH: NetBurst (Pentium 4) architecture
- 06_0EH: Intel Core Solo and Intel Core Duo
- 06_0FH: 65nm Intel Core microarchitecture (Core 2?)
- 06_17H and 06_1DH: Enhanced Intel Core microarchitecture (Core 2?)
- 06_1AH, 06_1EH, 06_1FH, 06_2EH: Nehalem
- 06_25H, 06_2CH, 06_2FH: Westmere
- 06_0AH, 06_0DH, 06_2AH: Sandybridge
I don’t have results for all of these CPU types. If you have one, with a 64-bit OS installed, then consider running infprecperf.exe. If you send me the results, perhaps with a screenshot of this part of Computer Properties:
that would be fabulous.
Raw Performance
Given that hyperthreads don’t help much (for this task), that number of cores (not logical threads) matters, that CPU frequency matters (allowing for typical TurboBoost frequency) all you have to do is calculate:
#cores * typical maximum frequency / Clocks per block.
It’s important to note that the two-cores-per-module in Bulldozer CPUs count as separate cores for these purposes, but the two hyperthreads per core in Intel CPUs do not. Also, in the list below note that I will list a 2.2 GHz CPU and then show its frequency as 2.6 GHz. This is the difference between nominal frequencies and observed typical Turboboost frequencies.
Here are some specific examples of CPU performance, from highest to lowest.
- 3.2 GHz desktop Sandybridge i7-3930K : 6 * 3.5 GHz / 3 = 7.0 Gblocks/sec
- 4.2 GHz AMD Bulldozer FX-8150: 8 * 4.2 GHz / 5 = 6.72 Gblocks/sec
- 2.2 GHz laptop Sandybridge i7-2720QM: 4 * 2.6 GHz / 3 = 3.47 Gblocks/sec
- 2.67 GHz 65 nm Intel Core Q6700: 4 * 2.67 GHz / 4 = 2.67 Gblocks/sec
- 2.6 GHz AMD Opteron: 2 * 2.6 GHz / 3.33 = 1.56 Gblocks/sec
- 2.4 GHz laptop Core 2 Duo T7700: 2 * 2.4 GHz / 4 = 1.2 Gblocks/sec
- 1.6 GHz Atom 230: 1 * 1.6 GHz / 11.5 = 0.139 Gblocks/sec
In reality I never saw the AMD Bulldozer hit those numbers – it peaked at about 6.12 Gblocks/sec. This may indicate OS overhead, but it looks more like some modest overhead from the shared components of the two-core modules.
The Atom numbers are the aggregate cycles-per-block-per-core assuming two threads per core.
You get the idea. Number of cores times frequency divided by cycles-per-block. AMD does well on the first two, but falls slightly behind on the third and is unable to claim the top spot.
I used to love that two core T7700 laptop. It was about fifteen times faster at deep-zooms than the 32-bit laptop it replaced. Now it just feels slow. My new laptop is almost three times faster, but I still need more cores.
Thanks for the help from the people at fractalforums.
But what about GPUs?
Inevitably people ask whether a GPU would be even faster. In general the answer appears to be no.
GPUs are extremely fast for float, and perhaps even double-precision fractal calculations. However these levels of precision are already so fast on CPUs that speeding them up is not, I think, particularly interesting.
So, the GPUs then need to compete in the high-precision math field, and they are not particularly well designed for that purpose. High-precision multiplies and add-with-carry are not key features of GPUs. I found a site that did some benchmarks of fractal calculations on a few different GPUs. They found that for 128-bit precision the GPUs were up to almost four times faster than the CPUs. That sounds like a victory, but the CPU code was doing 32-bit math (4x slower than 64-bit math), written using loops in C (at least 4x slower than fully unwound assembly language).
So, it seems safe to assume that the GPU would be four times slower than Fractal eXtreme is on the CPU, and therefore not of interest.
Yes, GPUs are continuing to grow faster, but so are CPUs, so I don’t expect things to change in the near future.
Knights Corner (which has an x86 instruction set and add-with-carry and many many cores) has the potential to be very fast at deep-zoom fractals. My back of the envelope calculations suggest that each Knights Corner core should be about 50% faster than a Sandybridge CPU at deep-zoom fractal calculations. This would make a 50 core Knights Corner quite intriguing, if the price is right.
I noticed you made a thumbnail out of my bench file. Neat. For benchmarking purposes, you could manually set the multiplier in BIOS (if the processor is unlocked) or optionally disable turbo mode altogether, to get a stable clock for measuring purposes. This should apply to both Intel and AMD architectures.
I have a couple more observations I made with regards to the AMD Bulldozer platform. The older Phenom II architecture apparently got more work done per clock than the newer Bulldozer. The bench file I procured, rendered in 4:55 on my 4-core Phenom II @ 3.2Ghz. The same file rendered in 2:41 on my newer 8-core Bulldozer system, with the clocks locked in at 4.2Ghz and a new BIOS update to disable CPU throttling. That’s twice as many cores, 1Ghz faster clock speed, and less than twice the performance. 😐 Because of this, I have a strong speculation that the clock cycles per block is lower on the older Phenom II platform than on the newer Bulldozer. I did not get to test the new benchmark tool on the old Phenom II processor since it was released some time after completion of my new build, and to test it now would require me to do a processor swap, which would be a pain. At stock speeds, I predict the 6-core Phenom II x6 1090T would compete on par with the FX-8150 (they both run at a stock non-turbo speed of 3.6Ghz) due to lower latency of the Phenom II platform.
One more aspect of the Bulldozer architecture, there is a very small performance penalty for running two integer threads together on the same module compared to running on separate modules. This likely accounts for the This likely accounts for the 9% difference in performance between the real world 6.123 and theoretical 6.72 values. For floating point code, however, that performance penalty would be significantly greater due to the shared FPU resources.
Been really enjoying reading about this stuff. After reading the earlier article on out-of-order-execution, I had a thought.
What if, instead of a second ADC, you had a SETC and an ordinary ADD from the setc’d register?
Could the SETC coissue with the remaining ADC, getting you closer to the theoretical 2 cycle MUL throughput those docs claim on Sandy Bridge?
I suppose I should try it out. But if you happen to know offhand why it wouldn’t work, that would be interesting too!
Yep, it’s counter-intuitive. But, according to the Agner Fog docs, ADC is two microops, so it seemed like substituting two similar ‘real’ ops wasn’t a loss, as long as it didn’t hit some kind of decode limit.
Now that I’ve tried it (in 32-bit, since I don’t have MASM) I got the same results (no improvement) Thanks for giving it a shot.
Looking at the second timing chart in your “then and now” performance article, I was hoping the change could, in essence, get the ADC in column 8 to happen in column 7, by having the first of the two broken-up ops co-issue with the ADC in column 8. (Same for 11 and 10, which would now be 10 and 9.) The SETC would coissue with a _different_ block’s ADC.
The Agner doc says SETC from a register uses either port0 or port5 and has a reciprocal throughput of 0.5 (so two per clock.) But, the doc isn’t too clear exactly what’s going on in with ADC (it says it has 2 p015 ops, with ‘x’ under all three.) I was hoping that ADC would leave port 0 or 5 open for a SETC to use. Guess not (or there’s some other reason I’m missing.) Wish the doc were a bit clearer in that regard. (While I’m wishing for things, wish Intel would just document this info!)
Something else that’s fun is when I added an XOR eax, eax above the setc (again, 32 bit,) it didn’t slow down at all. Another fun thing I tried was changing the _first_ adc to a setc and two adds, and the code only slowed down a slight bit.
I haven’t really played too much with Sandy Bridge, and I’ll keep playing with it!
Hey, I can’t reply to it either. So I’ve FINALLY got MASM/x64 working under VS2010 Express. Did some playing, and this is the best I came up with:
mov rax, [rsi+ByteOffset+0*PtrSize]
mul QWORD PTR [rsi+ByteOffset+1*PtrSize]
add r9, rax
adc ecx, 0
add r10, rdx
adc ebx, 0
On average, it seems to run about 5% faster than the original code on my Sandy Bridge CPU– so definitely not earth shattering. It unchains the high add from the low, by collecting the carries from low into another register, to be added immediately prior to the final step of the diagonal. [The adcs are 32-bit to avoid the REX prefix, but that isn’t critical.] I’ve probably done all I can here, and this isn’t that great, or likely portable across microarchitectures, but I thought it was fun and you might want to see it.
This is an Intel E6550: http://ark.intel.com/products/30783/Intel-Core2-Duo-Processor-E6550-(4M-Cache-2_33-GHz-1333-MHz-FSB)
VendorID = ‘GenuineIntel’
Stepping = 0xB, model = 0xF, family = 0x6, type = 0
Signature is 0x6FB (06_0FH)
CPU’s rdtsc speed is 2.333 GHz (peak speed may be higher)
Running tests on 2 thread system.
Performance with 1 threads is 0.577 GBlocks/sec
Performance with 2 threads is 1.153 GBlocks/sec
The results are consistent with your old laptop. It’s nice to see wine does not mess up cpu-bound apps.
I’d really be interested in seeing result on a 64-bit capable Atom. I suspect it will suck, but how much?
Here you go:
Intel(R) Atom(TM) CPU 230 @ 1.60GHz
—————————————-
VendorID = ‘GenuineIntel’
Stepping = 0x2, model = 0xC, family = 0x6, type = 0
Signature is 0x6C2 (06_0CH)
CPU’s rdtsc speed is 1.600 GHz (peak speed may be higher).
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.139 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.138 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.080 GBlocks/sec
Performance with 2 threads is 0.138 GBlocks/sec
And also,
Dual Core AMD Opteron(tm) Processor 285 2.59 GHz
—————————————————–
CPU information:
VendorID = ‘AuthenticAMD’
Stepping = 0x2, model = 0x1, family = 0xF, type = 0
Signature is 0xF12 (0F_01H)
CPU’s rdtsc speed is 2.593 GHz (peak speed may be higher).
Running tests on 2 thread system.
Performance with 1 threads is 0.776 GBlocks/sec
Performance with 2 threads is 1.546 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.777 GBlocks/sec
Performance with 2 threads is 1.546 GBlocks/sec
Running tests on 2 thread system.
Performance with 1 threads is 0.776 GBlocks/sec
Performance with 2 threads is 1.541 GBlocks/sec
Yessir– the Atom is a “230”, codename “Diamondville”, from 2008, which is hyperthreaded. The AMD Opteron is a “285”, codename “Italy” from 2006, with two actual cores. (It’s a champ– I used that computer up until last year.) I believe the AMD predates any kind of dynamic clock rate adjustment technology, but I’m not sure.
Have GPUs surpassed CPUs since this was written?