So I just watched a video by the WaybackTECH simply entitled 386DX-33 vs Celeron 420 1.6Ghz Faceoff and it got me thinking. A lot.
WaybackTECH has a lot of cool videos about old hardware and his repair guides are pretty fascinating. I have been frustrated at times, however, when he can't be bothered to explain some of his findings. But hey, I can't be bothered to benchmark a 386DX-33 against a Celeron 420 so we'll call this a team effort, shall we?
Because I used to teach computer science in schools it means I have to understand what I'm talking about. If you're one of these teachers who reads a book the night before and, the next day in class, regurgitates what you read, prepare to be exposed as a fraud because those kids will ask you questions. And yes, it's happened to me.
I certainly didn't understand the results of the above video. In summary, a 1.6GHz CPU with its caches disabled shows performance equivalent to a 33MHz 386. Yes, really. He didn't use any 'special' benchmarks, didn't cripple any of the components or cheat in any way - all he did was run 16-bit code in DOS. There was some discussion in the comments about how this could be possible and I started writing my own reply. But it got long. Really long. So I've taken it to the blog instead.
The short version
The long version
Fetch, Decode, Execute
When the first IBM PC was first released (it was called the model 5150 back then - no one called it a PC yet), the Intel 8088 CPU, the RAM and the 8-bit ISA bus (how instructions travelled between CPU and RAM) were the same speed: 4.77MHz. This continued when the the 80286 came along - later Intel versions of this ran at 8MHz and so did the 16-bit ISA bus. In both cases it would take approximately 5 cycles for a single instruction to be completed. So, just because a CPU runs at 8MHz, it doesn't mean it can complete 8 million instructions per second, no. A 10MHz 286 apparently clocks in at 1.5 million instructions per second. The reason for this is that the process of performing a task from start to finish takes time. Let's take the case of a simple addition being carried out:
|The Fetch, Decode, Execute Cycle|
[source: BBC Education]
2. Data A: It then needs to find out what data to perform the instruction on. For addition we need two pieces of data, both of which will be stored in two different locations. Each piece of data has a) an address and b) the data itself. On the 8088, the CPU first had to find out the address before it could retrieve it - that's two cycles. The 286 separated the address and data buses so that these could be retrieved simultaneously. Once retrieved, the data is also stored in a register.
3. Data B: the second piece of data is retrieved and stored in a register.
4. Perform instruction: addition is performed on the data in question.
5. Store new data: the result of the addition is stored back in RAM.
This is a very basic description of what happens, but I wanted to explain why an instruction takes 5 cycles instead of 1. Modern CPUs work very much in the same way, but with some extra bells and whistles such as branch prediction, parallel processing and pipelining. The thing about many of the modern features is that they are designed from day one to work with fast cache RAM. Which leads us on to...
Latency The biggest factor to be aware of is latency. When the CPU and RAM run at the same speed, it takes one cycle for the RAM to respond to a request from the CPU. As CPU speeds increased, so did the speed of the bus but only to a point. On faster 286 systems, such as those running at 20MHz, some expansion cards would not work correctly because they weren't ever designed to work at such speeds. This led to motherboard manufacturers introducing a 'bus divider'. This allowed the CPU to run at its normal speed, while peripherals ran at a slower rate e.g. 10MHz.
Meanwhile, the RAM remained connected directly to the CPU but, because faster RAM was extremely expensive, it continued to work at slower speeds in most consumer desktops. This introduced a problem: if the RAM was running half as fast as the CPU, for example, the CPU would have to wait twice as long for a request for data to be fulfilled. These 'wait states' caused a performance hit also known as a bottleneck.
|DRAM [source: Wikimedia]|
Note: a 386DX with cache is significantly faster than a 386SX without cache for two reasons. 1) The SX had to wait for the RAM every time it requested some data and 2) the SX had a 16-bit external bus, meaning it took two cycles to retrieve 32-bit data or instructions from RAM instead of 1 cycle for the fully 32-bit DX.
Moving on, we now skip forwards to more modern CPUs. Over time the speed gap between CPUs and RAM has widened significantly - RAM manufacturers focused on capacity, CPU manufacturers focused on speed and efficiency. Proportional increases in cache size and cleverer CPUs have bridged this gap somewhat, but it is still clock cycles and wait states that are the biggest factors in defining how quickly a system can perform when cache isn't there to lubricate things. Remove the cache, and you effectively negate the GHz advantage the CPU has over the RAM because it's is sitting there twiddling its thumbs for most of the time. A 386 CPU takes maybe 1 or 2 cycles to access memory because it runs at a similar speed to the RAM. That's not much twiddling at all. The difference between CPU speed and RAM speed is the single biggest factor affecting performance and this difference is pretty big in newer systems. Let's look at a 2GHz Pentium 4:
- The 2GHz CPU has a cycle time of 0.5ns.
- Its RAM runs at 400MHz, so the latency is 2.5ns.
2.5ns is the equivalent of 5 cycles and that's how long the CPU has to wait between memory accesses, so it is effectively running at 400Mhz too. It's only caching that prevents every memory access taking this many cycles and the full speed of the CPU can be harnessed. But a 400MHz CPU should still kick a 386's butt, right? There's something else...
CISC & RISC
HighTreason alluded to the other factor in his recent NexGen video: CISC and RISC. He declined the opportunity to explain the difference between the two so, again, I'll do a short explanation (although this entire article is turning into a long explanation!).
Every CPU has what's called an 'instruction set'. The original Intel 386, the granddaddy of all modern CPUs, was a CISC or Complex Instruction Set Computing CPU. As most people know, a CPU is made up of millions of transistors. The order in which these are wired together allows it to perform a range of tasks or instructions. Let's take addition again. To add two 1 bit numbers and carry a remainder you need as many as 18 transistors. Multiply this by 64 to be able to add two 64-bit numbers and that's quite a lot. Now, if you want to perform multiplication you can either use the adder logic to perform addition many times until you get your result or you can build physical logic that will perform mutiplication in one go. The latter approach is more efficient and faster, but it also adds to the complexity and the cost of the CPU. Currently, the x86-64 instruction set stands at over 2,000 instructions according to Stefan Heule at least. If basic addition of 64-bit numbers uses over 1,000 transistors, you can see how 1,999 more complex instructions is going to proportionally increase this amount, and maybe exponentially in some cases. Then you have to add transistors for cache and registers. More transistors increases the physical footprint of the CPU die, the amount of electrical power required, and the amount of heat produced.
RISC (Reduced Instruction Set Computing) addresses some of these issues. This approach to CPU design doesn't reduce the actual number of instructions the CPU is capable of performing like you might assume. It should technically be called Less Complex Logic Instruction Set Computing because the physical transistor logic is simplified and therefore it takes less time to complete a given instruction (e.g. 1 cycle instead of 5). The direct result of this is that it makes more space for registers (memory) on the chip itself, less frequent writes to main memory, and faster performance overall. Again, this is a basic (hopefully not inaccurate) explanation.
Up until the Pentium, Intel CPUs and their clones used a pure CISC approach: dense instructions that made the most of available RAM but also required frequent access to main memory. It became clear that the RISC approach was better from an efficiency point of view, especially in a scenario where RAM is significantly slower than the CPU. Intel developed a RISC processor core that still accepted x86 CISC instructions by translating them into RISC instructions. Extra cycles are required for this, which is fine when you have lots of spare cycles at your disposal. But in a cache-less system that is only running as fast as the RAM, these extra cycles divide the speed even further. I'd like to be able to provide actual proof of this but I have reached the limits of my technical insight. Maybe someone else can but I'm going to let the evidence speak for itself. I'm just providing an explanation.
Feel free to comment below if you'd like to add / clarify / critique. Just be nice about it :)
The Gap between Processor and Memory Speeds (semanticscholar.org)
What's new in CPUs since the 80s and how does it affect programmers? (danluu.com)
Memory specs for Pentium processors (philrees.co.uk)
Why can't you have both high instructions per cycle and high clock speed? (superuser.com)
Single-Cycle Performance (washington.edu)
CPU cache (wikipedia.org)
Reduced instruction set computing (wikipedia.org)
Complex instruction set computing (wikipedia.org)
A Zero Wait State Secondary Cache for Intel’s Pentium (nxp.org)
System Timing (phatcode.net)
Why does Intel hide internal RISC core in their processors? (stackoverflow.com)