Inside the CPU Grange-over-Sands
01524 412121
Morecambe
01524 400054
Morecambe
01524 417772
Morecambe
01524 411666
Morecambe
01539 737361
Kendal
0152 475 2208
Morecambe
0152 441 4468
Morecambe
01524 400351
Morecambe
07916 280537
Morecambe
015394 45892
Windermere
Inside the CPU
CPU Basics
The way in which CPUs are marketed these days often revolves around extra bells, knobs and whistles. These include, but aren't limited to, SSE instructions, branch prediction, cache size and Hyper-Threading. None of these things are essential for a processor to work, though, they are just bolt-on parts focused on one thing only: making the processor do more things in shorter amounts of time. And almost all processor research and development is focused on that one, simple thing: making it faster.
Here's a weird thing. Your Athlon 64 FX-55 PC with 2GB of RAM cannot do anything new. It is a fundamental law of digital computers - or more specifically, Turing machines - that they can perform any task that can be achieved by breaking down a problem into the manipulation of symbols, given enough time. There is nothing that a 3.6GHz Pentium 4 can do that a BBC Micro, circa 1982, with 32KB of RAM and a 1MHz processor cannot. It could process a Folding@home work unit, simulate global weather patterns and it could even render a Half-Life 2 frame, although it couldn't display it. The problem is you would grow old and die waiting for it to finish.
And here is another weird thing: CPUs are not, fundamentally, calculating machines. Digital computers are for the most part symbolic manipulators that use digital ones and zeroes to represent things in the real world and manipulate them. To prove this, consider this simple example of a program that you may have written at school:
REPEAT
PRINT "Hello"
UNTIL FALSE
See any maths there? No. In fact, calculations involving numbers are performed in a separate part of the processor, known as the ALU (arithmetic logic unit). This is one of the few parts of a CPU that really is essential.
What you need
A functioning CPU actually needs only a handful of separate logical parts, although these parts require several thousand transistors to implement. At the basic level, you first need a set of registers. A register is simply a very small amount of RAM - between one and four bytes will do. Registers are used to temporarily shift the results of operations around and store them. You also need a special register called a Program Counter (PC). When you reset your computer, the Program Counter defaults to the hard-wired value and the system can start executing instructions from the address in memory pointed to by the PC. However, to execute instructions you also need a unit to fetch instructions from the main memory, an instruction decoder and an execution unit.
Instructions
Remember that a binary computer simply manipulates and stores strings of binary values, which we write down as ones and zeroes. So an instruction is simply an arbitrary number that the designers of a CPU's basic language - its machine language or instruction set - have assigned a meaning to. Programmers construct programs consisting simply of these arbitrary binary values with data to manipulate. The data is also a set of binary digits, but it is distinctly different as it can represent anything from text to music. The only intrinsic meaning instructions have is that both you and the instruction set designer - in the case of Intel and AMD's desktop CPUs the instruction set is known as x86 - have agreed that certain instructions (called operators) do certain things to the data they are associated with (called operands).
A part of the CPU called the instruction fetcher gets the next instruction from the main memory - pointed to by the Program Counter - and pops it into the instruction decoder. The decoder then compares the operator with a hard-wired internal list of instructions and decides what it is. This is then passed to the execution unit to process. The instruction might, for example, be to 'add the operator value I'm passing you to the value already held in register A, and then store the result in register B'. This occurs, and the next instruction can then be carried out.
It is amazing that, out of these extremely simple instructions, things such as Half-Life 2 emerge. Modern PCs are built on layer upon layer of added complexity, but it all starts with extremely simple manipulation of binary values. The values are simply abstracted and manipulated to produce what you see on your screen or hear through the speakers.
Adding layers
So, you do not, in theory, need very much to produce a binary computer capable of performing any operation. But to get those operations to run at a reasonable pace - and make Doom 3 possible - you need some bolt-on extras. As far as instructions go, this started with the MMX (MultiMedia eXtensions) instruction set enhancements, which were added to the original Pentium in 1997. MMX added an easy way to do common integer 3D game operations such as matrix maths, with specific instructions for the purpose. AMD then came up with its 3DNow! extensions. These were different from MMX in that they were optimised for floating-point rather than integer calculations, making them more of a complementary tool to 3D hardware accelerators.
Intel quickly responded, though, and the idea has now been extended with SSE, SSE2 and SSE3. SSE is a brilliantly convoluted abbreviation within an abbreviation. It stands for Streaming SIMD Extensions, and SIMD in turn stands for Single Instruction, Multiple Data. Confused? Don't be. Single Instruction Multiple Data simply means that the programmer - or the compiler software that's optimised to take advantage of SSE - can pass the CPU a very long instruction 'word', with just the one instruction but lots of data, which is then manipulated all at once by hardwired logic on the processor. Incidentally, a 'word' in this context is simply a string of bytes that are treated as one value. SSE is great for repetitive transformation and encoding tasks, and it often replaces the FPU (floating-point unit) for high-precision maths.
Integer? Floating-point? Huh?
Ah yes, sorry. All of this stuff sounds incredibly complex until it is explained. Integer calculations (also called operations) are simply those that involve whole numbers. So 1 + 1 is an integer operation, while floating-point operations are those that involve decimal fractions, for instance 1.23 + 1.67, and that's all there is to it.
The reason it is such a big deal in computing terms is that manipulating integer (whole) numbers is far less complex and less time-consuming than floating-point values. This is why there is now such a huge emphasis on accelerating floating-point operations with things such as the FPU (floating-point unit) and now the SSE units on Athlons and Pentiums.
Pipelining and branch prediction
Once again it all comes down to the same goal - making everything faster. But there are two other ways you can speed up the CPU too. You can increase the rate at which it is clocked - in other words, the rate at which each instruction is decoded, which depends on having fast-switching transistors - or you can increase the number of instructions that are decoded per tick of the clock. Intel and AMD processors have taken different ideological perspectives on this. When it abandoned the Pentium III core in favour of the NetBurst architecture of the Pentium 4, Intel put its eggs in a dodgy basket: NetBurst was optimised for the fastest clock speeds possible and fewer instructions per clock than previously. This means it has fewer transistors than an Athlon 64 - nearly three times fewer, in fact, which in turn means that the maximum theoretical power dissipation will be reached at a higher clock level than with an Athlon. AMD, on the other hand, does not have the resources for building CPU fabrication plants (also known as fabs, see above) that Intel has, so it is been slower to acquire the technology for producing very fast transistors on a wafer. So, instead, it produced a more transistor-laden CPU with greater parallelism - that is, a greater number of replicated areas of the processor core to enable simultaneous operations - a much larger Level 1 cache (128KB as opposed to 16KB) and around double the number of transistors in total (around 105 million as opposed to 55 million).
There are four basic steps to processing a CPU instruction. The instruction first needs to be fetched from main memory or, more likely from the on-die CPU cache (which we will come to later). The instruction, which is simply a binary value, then needs to be decoded to see what it is and what needs to be done with it. Next, it has to be executed to produce a result, and finally the result must be stored back to the appropriate location in the memory so the program can continue.
In an old-style processor, one instruction would be allowed to start this process at a time, going through each stage in turn. The result of the instruction would be stored before the next instruction could be fetched from main memory, and only one instruction at a time would be in processing. Modern processors, however, employ a technique known as pipelining, so as soon as one instruction has been moved into the decoding stage, the next is being fetched from memory. So when the first one is moved to the execution stage, the second goes into decoding and a new instruction is pulled from the memory.
Deep pipelines
The problem with pipelining, though, is program branches. Most programs are not a linear set of instructions in memory, and the outcome of one instruction may require the program execution to jump to another place in memory and continue from there. The problem is that if, when this happens, the rest of the pipeline is full of a linear set of instructions that have been pulled from memory in blind sequence, then those instructions are useless, which means the pipeline must be flushed and filled up again. The longer your pipeline, the longer it will take for the new stream of instructions to filter through to the end of the pipeline and start spitting out actual results; until then your program is stalled. If you have a very deep pipeline - as the Pentium 4 does (31 stages for the Prescott core as opposed to ten in the Pentium III) - then the penalty for not successfully predicting that a branch will occur is far more severe than a CPU with a shallower pipeline, namely an Athlon XP or Athlon 64, which have fewer than half that number.
Consequently, the performance of a Pentium 4 relative to an Athlon 64 is highly dependent on the software running on it. Very linear code that does the same thing in a very repetitive sequence for a long time will tend to get the most benefit from a long pipeline. This is why Pentium 4s tend to excel at media encoding. Most software, however, is more 'branchy', and this benefits from the shallower pipeline of the Athlon 64. This is also why a Pentium M (which also has a shallow pipeline) clocked at 2GHz is nearly as fast as a Pentium 4 clocked at 2.8GHz.
Cache for questions
One of the easiest ways, from an engineering perspective, to get increased processor performance is by increasing the amount of local cache memory on the CPU. The principle is simple: when Windows loads up a game or application from the hard drive it pulls that data into the main memory in your motherboard's DIMM sockets. However, as programs are executed, the CPU saves time by preloading chunks of data from the main memory into its cache, and there are various algorithms that the cache management system can use to try to predict which areas of main memory to preload.
The ideal is that whenever the execution unit wants to access programs or data in the main memory, the cache already has it available. The cache can then circumvent the main memory - which is relatively slow - to give the relevant bytes to the execution unit much quicker. You might think that the latest 533MHz (effectively 1,066MHz) DDR2 modules are fast, but the cache on both the latest Athlon 64 and Pentium 4 chips runs at the full speed of the chip - that is, up to 3.6GHz - so the cache is enormously faster, and there is virtually no overhead with actually accessing and arbitrating the transfer either. The Athlon 64, with its directly CPU-connected main memory is better in this regard than the Pentium 4's old-style front-side bus system - one more reason that Athlon 64s are faster than Pentium 4s in so many tasks - but cache is always faster.
Levelling things out
But why does cache come in different levels? The reason is simple. Level 1 cache is the closest to the actual CPU in terms of efficiency of access and minimal overhead: the less cache you have, the less time you spend addressing overheads and the quicker you can get to what you want. So temporary results from CPU operations tend to get written out to the Level 1 cache so they can be pushed back in again extremely quickly when they are required. For this reason, you do not want too much Level 1 cache, but you do want a fast repository for larger amounts of data than the Level 1 cache can store. This is why cache is tiered into Levels 1, 2 and, with the Pentium 4 Extreme Edition and Xeons, Level 3. The further down the cascade the cache is, the less performance penalty you get from large amounts of it, but the further you isolate the relatively slow main memory. It is also easier to integrate Level 2 and 3 cache into an existing CPU core design without too much redesign work. It is all about getting an efficient compromise.
Quick, quick, sloooow
If you are very, very old, you may remember that in the days of Windows 3.1, circa 1994, there used to be two types of Intel 486 processor: the 486DX and 486SX. In fact, both types of chips started out life as the 486DX. To maintain the prestige of the 486DX, a certain proportion of them had their maths co-processors - specialised ALUs for the purpose of floating-point calculations - burnt out, and these were sold as the cheaper 486SX. This type of marketing shenanigan is the reason often given for the initial high price of a new processor, which then ends up being sold for perhaps a sixth of its initial price a year down the line.
For the very fastest parts this is true, but there is another, non-marketing factor as well. During the lifetime of a CPU, the yield (the percentage of CPU dies that are defect-free when tested after manufacture) - increases. This is down to refinements in the fabbing process, which is continually improved and researched as manufacturers strive towards the next generation. Initially, only a certain number of parts will be clockable to a certain speed: these get packaged up and sold at a premium. Others will only clock to perhaps 80 per cent of the maximum, while others will be completely dead. Throughout the life of the part, that curve improves to the point where most of the parts can be clocked to a reasonable level and more can be pushed higher. Hence there are more of the faster chips available and they are cheaper to produce. The chip manufacturers aren't entirely cynical, and they do have their $2 billion fabs to pay for.
Athlon 64s are faster because they are 64-bit, right?
You probably already know that 64-bit processing means virtually nothing to the current computer world, but in case you do not, let's dispel the many myths about 64-bit processors once and for all. AMD's Athlon 64 and FX processors are certainly damn fast, and in most cases they are faster than the equivalent Pentium 4 part. But that has nothing, zilch, nada, to do with them being 64-bit processors. Nothing whatsoever. Sod all. A 64-bit processor needs a 64-bit operating system and 64-bit-aware applications to make use of its 64-bitness. AMD has actually pulled off one of the best marketing coups in history with the Athlon 64, implying when it was launched - but not stating explicitly - that we would soon all have 64-bit Windows running on our desktops. But, except for a few beta testers of course, none of us actually has this. What we do all have, though, is a 64-bit capable processor that's very well designed, with its integrated memory interface and HyperTransport peripheral connection, and runs bloody fast in 32-bit mode, which is just what it is doing. Having 64 bits has nothing whatsoever to do with it.
So what is the difference between a 64-bit processor and a 32-bit one? The answer depends on your point of view. In the old days it was simple, and came down to the width of the data bus of your CPU, along with which data and instructions were passed from main memory to the CPU and back again. Old computers such as the Sinclair ZX Spectrum, BBC Micro and Commodore 64 were 8-bit systems. They passed their chunks of data around on their data buses in blocks of 8 bits. But if you do some simple binary maths, you find that 8 bits - also known as a byte - can only represent numbers up to 255. In case you have not noticed, it is pretty common to want to use numbers bigger than that. Every time you do, an 8-bit system has to start shuffling the numbers around in multiple blocks, doing bits of calculations in the ALU and then reassembling the parts. This is inefficient and makes 8-bit computers very slow.
With the likes of the Commodore Amiga, Atari ST and 8086 processor, however, computers moved into the 16-bit era, and their data buses and ALUs were 16 bits wide. Doubling the data bus width enormously increases the highest number you can fiddle with before having to chop it up: you can get up to 65,535 with a 16-bit CPU, and the majority of calculations are suddenly performed much, much faster. Then, with the Intel 386 processor, things went up to 32 bits; it is here you start to hit diminishing returns. With 32 bits you can represent numbers up to 4,294,967,295 - over four billion. This gives you another speed boost over a 16-bit processor, but it is not often that most games or applications need to think about numbers over four billion.
Consequently, a 64-bit processor such as the Athlon 64 - which can directly manipulate numbers up to 18,446,744,073,709,551,615 - does not give anything like the immediate speed increase of going from 16 to 32 bits. In fact - just between us - it makes almost no difference at all. This is because on the odd occasion when you do need a really, really long binary number or value, it is automatically passed off to one of the specialised registers anyway: the SSE registers in a Pentium 4, for instance, are 128 bits wide. So 64-bit processing, unless you are a computing farm doing weather-pattern simulation of the entire globe or a nuclear research facility, is unnecessary.
Where 64-bit chips do score over 32-bit processors, though, is in the amount of memory they can address, which can now be up to 64 bits in length. In other words, where your 32-bit processor can be fitted with a maximum of 4GB of addressable RAM (the maximum number of bytes you can directly represent with a 32-bit number), your 64-bit processor can be fitted with a terabyte of the stuff and see it directly without any messing around with chopping up addresses. This is where 64-bit processors really shine, but it is going to be a while before we need more than 4GB of RAM on the desktop. When we do, the 64-bit Longhorn version of Windows will be ready and waiting to run on our Athlon 64s. Until then, get used to the fact that the 64 in your processor's model name is nothing but posturing. Sorry.
techniques to extend the life of silicon
As you probably know, it is getting harder and more expensive to keep producing faster CPUs, and we are reaching the limits of technology as far as silicon integrated circuits are concerned. The silicon transistor was invented in 1948 and the engineers have to keep coming up with new tricks to prop up what is now a 56-year-old technology. Low-k ILDs (interlayer dielectrics), Cu interconnects and strained silicon are the latest fabbing buzzwords, but what do they mean?
Strained Silicon
The speed that an electrical signal can propagate through a processor depends on the switching time for a transistor (its maximum frequency), and the time it takes for the signals to travel between transistors, which is known as the RC (resistance-capacitance) delay.
As you can imagine, though, you need some serious tricks to get a 90nm transistor switching at 3.6GHz (that's 3,600,000,000 times every second). The newest trick, which is essential for 90nm parts, is strained silicon. This is where the silicon of a wafer is literally put under physical strain to stretch the lattice and give the electrons a freer path through the material. This increases the current through the transistors up to the requisite level for high frequency switching.
Low-k ILDs and Cu interconnects
The 90nm transistors on a Pentium 4 wafer are arranged in seven interconnected layers, and these layers need to be kept separate so they only connect together where needed. The layers also need to be close together, but the closer two layers are, the more they tend to interfere with each other by storing charge: this effect is known as capacitance.
It is therefore no use having a fast-switching transistor if your propagation delay is very high. For the fastest chips then, you need low-electrical-resistance interconnects and high-resistance, low-capacitance, layer-separating transistor parts.
In the old days, aluminium was used for interconnects because it is an inert metal that's easy to work with. Its resistance is too high for modern chips, though, so Copper (Cu) interconnects are used instead. This is a huge challenge because copper is very reactive and can cross-contaminate other processes in fabbing at the drop of a hat, which in extreme cases means the fab has to be shut down. It is one more factor that adds to the cost and complexity of new processor designs.
Low-k is the complementary side of the RC equation. It simply refers to the fact that the separating layer has a very low electrical capacitance, and this is currently achieved by something called a carbon-doped oxide (CDO) dielectric. The dielectric constant k of CDO is extremely low, which lowers the propagation delay. Focusing on lowering k even further is now one of the biggest challenges to processor design: nothing works better than copper, so better ILDs are the only way to get lower RC.
Conclusion
In short, there is a lot more inside that little green square than you would think, but how far the processor goes in the future remains to be seen. It certainly does not look like the clocks are going to go much higher, and it is debatable whether we need them to either. It is also fair to say that it is going to be a good few years before we really see an advantage to owning a 64-bit processor too. But with dual-core processors on the horizon, Moore's law could soon be laid to rest anyway. Look out for a full feature on dual processing next issue.
Author: David Fearon