Wednesday, 21 December 2016

Caches, Latency, RISC & CISC: How Does a 386 Beat a 1.6GHz Celeron?

So I just watched a video by the WaybackTECH simply entitled 386DX-33 vs Celeron 420 1.6Ghz Faceoff and it got me thinking. A lot.

WaybackTECH has a lot of cool videos about old hardware and his repair guides are pretty fascinating. I have been frustrated at times, however, when he can't be bothered to explain some of his findings. But hey, I can't be bothered to benchmark a 386DX-33 against a Celeron 420 so we'll call this a team effort, shall we?

Because I used to teach computer science in schools it means I have to understand what I'm talking about. If you're one of these teachers who reads a book the night before and, the next day in class, regurgitates what you read, prepare to be exposed as a fraud because those kids will ask you questions. And yes, it's happened to me.

I certainly didn't understand the results of the above video. In summary, a 1.6GHz CPU with its caches disabled shows performance equivalent to a 33MHz 386. Yes, really. He didn't use any 'special' benchmarks, didn't cripple any of the components or cheat in any way - all he did was run 16-bit code in DOS. There was some discussion in the comments about how this could be possible and I started writing my own reply. But it got long. Really long. So I've taken it to the blog instead.

The short version

New CPUs are designed to work with cache, where old CPUs, like the 386, were not (although cache could dramatically enhance performance). It doesn't matter how fast a CPU is if it takes many clock cycles for it to complete an instruction. Example: if it takes 5 cycles for a 16MHz 386 to do a specific task (such as addition) then it can do 3.2 million additions per second. What's happening here is that, even though the Celeron is running 100x faster, if it takes 500 cycles to do addition (without caches), then it will only be able to do 3.2 million additions per second as well. It's actually a lot more complicated than this as you can see below, but this is a basic, crude way of explaining it.

The long version

So why the hell does a fast CPU sit around for so long with its caches disabled? First we need to talk about how caches and CPUs work. There are many articles out there explaining this along with the reasons why CPUs are so dependent on them, but many of these sources are overly technical or dull. I don't claim to be any less so - it's a dry subject, but I'll do my own summary here anyway for those who a) like to read plain English b) are interested in the history of computing and c) can't be bothered to read it elsewhere.

Fetch, Decode, Execute
When the first IBM PC was first released (it was called the model 5150 back then - no one called it a PC yet), the Intel 8088 CPU, the RAM and the 8-bit ISA bus (how instructions travelled between CPU and RAM) were the same speed: 4.77MHz. This continued when the the 80286 came along - later Intel versions of this ran at 8MHz and so did the 16-bit ISA bus. In both cases it would take approximately 5 cycles for a single instruction to be completed. So, just because a CPU runs at 8MHz, it doesn't mean it can complete 8 million instructions per second, no. A 10MHz 286 apparently clocks in at 1.5 million instructions per second. The reason for this is that the process of performing a task from start to finish takes time. Let's take the case of a simple addition being carried out:

The Fetch, Decode, Execute Cycle
[source: BBC Education]
1. Retrieve Instruction: the CPU has to access the RAM and find out what instruction it's doing next. This instruction is remembered in a 'register' (a very small area of memory).

2. Data A: It then needs to find out what data to perform the instruction on. For addition we need two pieces of data, both of which will be stored in two different locations. Each piece of data has a) an address and b) the data itself. On the 8088, the CPU first had to find out the address before it could retrieve it - that's two cycles. The 286 separated the address and data buses so that these could be retrieved simultaneously. Once retrieved, the data is also stored in a register.

3. Data B: the second piece of data is retrieved and stored in a register.

4. Perform instruction: addition is performed on the data in question.

5. Store new data: the result of the addition is stored back in RAM.

This is a very basic description of what happens, but I wanted to explain why an instruction takes 5 cycles instead of 1. Modern CPUs work very much in the same way, but with some extra bells and whistles such as branch prediction, parallel processing and pipelining. The thing about many of the modern features is that they are designed from day one to work with fast cache RAM. Which leads us on to...

Latency The biggest factor to be aware of is latency. When the CPU and RAM run at the same speed, it takes one cycle for the RAM to respond to a request from the CPU. As CPU speeds increased, so did the speed of the bus but only to a point. On faster 286 systems, such as those running at 20MHz, some expansion cards would not work correctly because they weren't ever designed to work at such speeds. This led to motherboard manufacturers introducing a 'bus divider'. This allowed the CPU to run at its normal speed, while peripherals ran at a slower rate e.g. 10MHz.

Meanwhile, the RAM remained connected directly to the CPU but, because faster RAM was extremely expensive, it continued to work at slower speeds in most consumer desktops. This introduced a problem: if the RAM was running half as fast as the CPU, for example, the CPU would have to wait twice as long for a request for data to be fulfilled. These 'wait states' caused a performance hit also known as a bottleneck.

DRAM [source: Wikimedia]
The speed of RAM and of the CPU is typically indicated in multiples of Hz (cycles per second). By dividing 1 second by this amount it gives us the 'clock period' in nanoseconds - this is the smallest period of time it takes for an instruction to take place. The faster a component in MHz, the less time it takes to do something, and the more it can do per second. In the 286 era PC system memory at this time was typically DRAM, which had a latency of 120 nanoseconds. This works out at about 8MHz. SRAM from the time had a latency of only 10ns (100MHz) but was very expensive, so they used it to create a smaller area of memory between the CPU and the RAM - this is the cache. A 386DX motherboard with 64KB of cache made a huge difference to the performance of the system because frequently-used instructions remained in the cache meaning that writes to and from the main RAM were less frequent. I'm not going to go into any more detail about the caching process itself because it's not important, especially not in a video where it wasn't used!

Note: a 386DX with cache is significantly faster than a 386SX without cache for two reasons. 1) The SX had to wait for the RAM every time it requested some data and 2) the SX had a 16-bit external bus, meaning it took two cycles to retrieve 32-bit data or instructions from RAM instead of 1 cycle for the fully 32-bit DX.

Moving on, we now skip forwards to more modern CPUs. Over time the speed gap between CPUs and RAM has widened significantly - RAM manufacturers focused on capacity, CPU manufacturers focused on speed and efficiency. Proportional increases in cache size and cleverer CPUs have bridged this gap somewhat, but it is still clock cycles and wait states that are the biggest factors in defining how quickly a system can perform when cache isn't there to lubricate things. Remove the cache, and you effectively negate the GHz advantage the CPU has over the RAM because it's is sitting there twiddling its thumbs for most of the time. A 386 CPU takes maybe 1 or 2 cycles to access memory because it runs at a similar speed to the RAM. That's not much twiddling at all. The difference between CPU speed and RAM speed is the single biggest factor affecting performance and this difference is pretty big in newer systems. Let's look at a 2GHz Pentium 4:

- The 2GHz CPU has a cycle time of 0.5ns.
- Its RAM runs at 400MHz, so the latency is 2.5ns.

2.5ns is the equivalent of 5 cycles and that's how long the CPU has to wait between memory accesses, so it is effectively running at 400Mhz too. It's only caching that prevents every memory access taking this many cycles and the full speed of the CPU can be harnessed. But a 400MHz CPU should still kick a 386's butt, right? There's something else...
HighTreason alluded to the other factor in his recent NexGen video: CISC and RISC. He declined the opportunity to explain the difference between the two so, again, I'll do a short explanation (although this entire article is turning into a long explanation!).

Every CPU has what's called an 'instruction set'. The original Intel 386, the granddaddy of all modern CPUs, was a CISC or Complex Instruction Set Computing CPU. As most people know, a CPU is made up of millions of transistors. The order in which these are wired together allows it to perform a range of tasks or instructions. Let's take addition again. To add two 1 bit numbers and carry a remainder you need as many as 18 transistors. Multiply this by 64 to be able to add two 64-bit numbers and that's quite a lot. Now, if you want to perform multiplication you can either use the adder logic to perform addition many times until you get your result or you can build physical logic that will perform mutiplication in one go. The latter approach is more efficient and faster, but it also adds to the complexity and the cost of the CPU. Currently, the x86-64 instruction set stands at over 2,000 instructions according to Stefan Heule at least. If basic addition of 64-bit numbers uses over 1,000 transistors, you can see how 1,999 more complex instructions is going to proportionally increase this amount, and maybe exponentially in some cases. Then you have to add transistors for cache and registers. More transistors increases the physical footprint of the CPU die, the amount of electrical power required, and the amount of heat produced.

RISC (Reduced Instruction Set Computing) addresses some of these issues. This approach to CPU design doesn't reduce the actual number of instructions the CPU is capable of performing like you might assume. It should technically be called Less Complex Logic Instruction Set Computing because the physical transistor logic is simplified and therefore it takes less time to complete a given instruction (e.g. 1 cycle instead of 5). The direct result of this is that it makes more space for registers (memory) on the chip itself, less frequent writes to main memory, and faster performance overall. Again, this is a basic (hopefully not inaccurate) explanation.

Up until the Pentium, Intel CPUs and their clones used a pure CISC approach: dense instructions that made the most of available RAM but also required frequent access to main memory. It became clear that the RISC approach was better from an efficiency point of view, especially in a scenario where RAM is significantly slower than the CPU. Intel developed a RISC processor core that still accepted x86 CISC instructions by translating them into RISC instructions. Extra cycles are required for this, which is fine when you have lots of spare cycles at your disposal. But in a cache-less system that is only running as fast as the RAM, these extra cycles divide the speed even further. I'd like to be able to provide actual proof of this but I have reached the limits of my technical insight. Maybe someone else can but I'm going to let the evidence speak for itself. I'm just providing an explanation.

Feel free to comment below if you'd like to add / clarify / critique. Just be nice about it :)

The Gap between Processor and Memory Speeds (
What's new in CPUs since the 80s and how does it affect programmers? (
Memory specs for Pentium processors (
Why can't you have both high instructions per cycle and high clock speed? (
Single-Cycle Performance (
CPU cache (
Reduced instruction set computing (
Complex instruction set computing (
A Zero Wait State Secondary Cache for Intel’s Pentium (
System Timing (
Why does Intel hide internal RISC core in their processors? (

Monday, 14 November 2016

Amstrad Adlib Clone: Restoration

A Brief Introduction to PC Sound

Up until the Pentium II came out (or thereabouts), desktop PCs did not generally come with a built-in sound card. As they were still mostly considered to be 'business machines', audio was not compulsory by any means and businesses were unwilling to pay the additional cost for this unnecessary hardware. I'll save the history of sound cards for another entry and say, for now, that one obviously had to invest in additional equipment to give their PC sound-producing capabilities back then.

Enter The Adlib

The Adlib sound card was one of the first of these, and enabled a PC to produce synthesised sounds from Yamaha's OPL2 FM chip (the kind of sounds you would have heard from an old portable keyboard at school or something). When we upgraded our first PC with a 'multimedia kit' some time in 1993 or 1994, it came with what I think was probably an Adlib clone. I know it wasn't the real deal because it came with a joystick, which required a 'game port' and the Adlib lacked one of these. I sold this card to someone at my school for £10 a couple of years later when I upgraded to a SoundBlaster AWE32, which allowed digitised sound in addition to wavetable (more realistic) music, so what need did I have for the crappy Adlib then?

Well, twenty years later I'm messing around with old computers again and, for some games, an Adlib card is one of the few ways to enable authentic period-correct audio. One of the reasons for this is that it is an 8-bit card, so it will work in any PC as old as the original IBM PC or XT. For a while I had been missing the authentic sounds of the OPL2 I had grown up with so I was keen on acquiring one. Because these cards are becoming increasingly rare, they are sought after and fetch pretty big prices on eBay. I say big prices: over $360 is pretty big for a 30 year old device that cost $245 when it was new on the market. Its successor, the Adlib Gold, is even more rare and has consistently sold for well over $3,000 in the last year.

Anyway, that's more than I'm willing to pay for the sake of nostalgia, especially when there are other FM synthesis options out there such as cards featuring the YMF718 (aka OPL3-SAx). These cost around $10, have an ISA interface (almost essential for DOS gaming - PCI doesn't play nice) and they give you the Adlib sound, plus digitised sound effects. I have one of these in my 386 desktop and it works fine. The downside is that it's 16-bit, and requires the loading of drivers in order to work. This eats up valuable RAM.

But then...

Seller's picture of the card.
A Clone Appears

I'm a member over at the Vogons forums (Very Old Games On New Systems), which started out with an emphasis on emulation, but now has a huge hardware interest as well. They have a long-running thread where people watch eBay for stuff of interest and, if it's not of interest to the person who found it, they share it with everyone else. Well, someone posted up that someone was selling an Amstrad Adlib clone and I got curious. Bidding started at 99p so I thought 'what the hell'. What I actually did was chuck a £25 max bid in and thought nothing more of it. A few days later it turned out I'd won the thing! It was only then I noticed what terrible condition it appeared to be in (which is why it went for £25 rather than the usual £50 a clone would normally sell for). The volume control, game port and gold edges appeared to be corroded or at least rusted from where the card has been stored in damp conditions.

People bad-mouth clones but, when you've got a computer with only 2 ISA slots like I do, you need to save space as much as possible. Having the built-in game port is a bonus as far as I'm concerned, and some people are more interested in functionality rather than collectability.

So, a bit of history on the card. Amstrad began manufacturing 'affordable' IBM PC clones in the '80s following the relative success of the CPC. They were a bit late to the game, only releasing it in 1986, hence why Amstrad were able to manufacture them so cheaply. Over the next decade or so, they released a number of models intended for home users based on the 8088, 286 and 386, culminating in the Mega-PC. Soon after this they withdrew from the PC market (just as it became popular!)

This particular sound card (model 3100-015P-4) was bundled with the PC5286, which was marketed as a gaming PC. It's likely the system's designers considered including the Adlib in their machines, but the lack of game port would have required either including one on their motherboard design or taking up an additional ISA slot with a controller card of some kind. On balance they must have decided it made economical sense to manufacture their own YMF3812-based card, with a game port included.

So it was only when the card arrived that I fully realised what bad condition it was in:

Corrosion around the ISA contacts
Rusted volume dial, plus various discolouration elsewhere

Rust around the game port: at best, unattractive; at worst, indicative of damage

I'd been reading a lot about cleaning up old equipment and had picked up a few tricks, so I got straight to work.

Step one: remove the metal items (the bracket / volume control) and the capacitors:

The silkscreen indicated polarity and gave each component a number, so I wrote down each corresponding rating for each capacitor. The bracket and game port shield came off easily and I submerged them in vinegar. The potentiometer was tricky because I don't have a soldering station with vacuum, only a solder sucker, so I had to spend time doing this bit by bit without scorching the PCB. This is how it looked once I removed it:

Step two: I used vinegar again (no isopropyl alcohol here) and some cotton buds to swab the board, contacts, etc. I left some vinegar on the rusted pot contacts for a few minutes as well. I then rinsed the board and dried it. I used washing up liquid and a scouring pad to clean up the gold contacts and the rusty spots and this technique worked quite well:

So the 'corrosion' wasn't corrosion at all, just deposits on the surface of the board. Here's the board in its rejuvenated state:

And here's my attempt at a composite before / after shot:

Step three: I removed the metal items from the vinegar, rinsed and dried. Most of the actual rust had come off but I used the scouring pad additionally to get off as much of the actual corrosion as I could, as it had darkened the metal.

Step four: resoldered the pot, which was looking much better than before. Looked through my collection of components and, luckily, had replacements for all the caps. This isn't a high-load card or one that will get hot so I wasn't bothered about using solid caps, just reliable ones. Soldered them all back in, then replaced the shield and bracket:

How the back of the card looks following the elbow grease

And a final overview of the card. Looks literally good as new. I had a big grin on my face.

Step five: testing!! The moment of truth. I dug out a 486 motherboard. At first it seemed to clash with the video card as I got no signal but, once I'd moved it to another slot, it worked fine! No popping caps, no smoking, no unpleasant noises from the speakers. It did seem to be picking up quite a bit of noise from the hard drive I/O though (I have seen this mentioned on the web). Fired up Wolfenstein 3D and saw that magic indicator!

At first I had no audio coming out of the speakers whatsoever and I was a bit disappointed (but not surprised). However, I realised that audio had merely been disabled in the options. The reward for my hard work: an Adlib clone for £25, and pure, unadulterated OPL2 audio!!!

And this is the moment that makes all this nostalgia stuff seem worthwhile: I felt like a teenager again. For about half an hour I played Wolf 3D and it was like the first time. It's a difficult feeling to explain if you weren't there at the time :)

Postscript: a few weeks later the board stopped working. I'm currently looking into reverse-engineering the card and making a replacement: it will be a clone of a clone!

Monday, 8 February 2016

YouTube Channel Up & Running

A very quick note to link my YouTube channel which has a grand total of


videos on it. More to follow. And blog entries to back up each video / add information, plus articles on various things.