2012-04-09

DCPU-16 Review

By marcan
Filed under

I’ve always liked the idea of building complex logic systems out of a simple primitive that is just powerful enough to construct all logic - particularly in videogames. For example, in LittleBigPlanet, you can build a Tic-Tac-Toe AI out of physical elements like pistons and magnetic switches. In Minecraft, you can build an ALU out of a primitive construction element that behaves, essentially, as a NOT gate. And, if games aren’t your thing, you can build CMOS logic out of UNIX pipes with a simple “mosfet” program.

Just a few days ago, Notch, the creator of Minecraft, revealed a new game, 0x10^c. Instead of giving players a simple logic element, it will include a full-blown 16-bit CPU that can be programmed. I find this intriguing, because it allows for much more complex development yet it doesn’t step right into the boring world of “let’s just throw in a lua scripting engine and call it a day”. You’re still limited by emulation speed and by memory constraints.

However, after checking out the preliminary spec for the CPU, I was a bit disappointed. While the general idea is good, that particular design suffers from certain design flaws that will make it a lot more painful to use than it should be. Specifically, with 128kB of memory, you’d expect to be able to program it in C or some other high level language. Unfortunately, the design is rather hostile to having a proper C compiler targeted to it. The situation reminds me of PIC16, for which I refuse to use C compilers because they all suck (and only rarely use them for PIC18, for that matter), only this is a nicer 16-bit architecture. Things shouldn’t need to be this way.

These are some of the design issues that I found with the architecture:

Memory Layout

DCPU-16 is, as its name implies, a 16-bit CPU. However, what isn’t so obvious at first glance is that it’s only a 16-bit CPU. Memory is addressed word-wise and there are no 8-bit loads and stores.

Why is this a problem? After all, you can still emulate 8-bit loads and stores by using masking and shifting. Ah, but how do you address bytes? To address a byte, you need 17 bits: 16 bits for the word address and one bit for the byte. This doesn’t fit in one 16-bit word. For a C compiler, that means that char * needs to be effectively two words (32 bits) wide. And since any pointer can be cast to and from a char *, that basically means that all pointers have to be 32 bits wide. This complicates pointer arithmetic and generates a lot of boilerplate code every time there’s a pointer access, for little benefit. The read-modify-write masking required for the 8-bit writes is also wasteful.

There’s an alternative: make char 16 bits and CHAR_BIT == 16. That is allowed by the C standard, but I’ll let you take your own guess as to how much code out there would break with 16-bit bytes.

Registers

Eight general registers are the bare minimum, but they’re on the low side. Just look at x86: compilers have been struggling for years with this few registers, and they end up using the stack as more registers (modern x86 CPUs include a lot of logic to speed up stack accesses for this reason). Calling conventions barely have any breathing room to use registers for argument passing (and several such conventions have been developed as a result). Now, this wouldn’t be so terrible, but there’s another problem that makes it worse: DCPU-16’s stack pointer is a special register that supports fewer operations than regular registers. It can only do push and pop operations, not offset-addressing, which means that spilling out data to the stack is much harder because it does not allow random access. In practice, that means that compilers will set up a copy of the stack pointer in the function prologue to act as a frame pointer, and now you’re left with seven registers.

Eight 16-bit registers are also not enough to perform 64-bit operations entirely in registers.

Operands

Notch’s design is very orthogonal in this sense, where either instruction operand can be in memory, or a register, or a constant, for any instruction. Wait, a constant? Yes, the encoding lets you operate on constants as a destination. This is silently ignored, but let’s see what the encoding overhead of it is: constants are only valid as the a operand for the IF?? set of instructions, as every other instruction writes back to operand a. Of the IF?? instructions, IFE and IFN are commutative, so using a constant as a doesn’t gain you anything over specifying it in operand b. That leaves two instructions, IFG and IFB, where this is useful, or ²⁄₁₆ = 12.5% of the instruction set. What is the overhead? In the operand encoding, half of the values (all of those with the top bit set, 0x20..0x3f) are constants. That’s one bit, or half of the instruction encoding space. Since only two instructions take advantage of this, 50% * (100%-12.5%) = 43.75% of the possible 16-bit instructions are useless just because of this fact. That’s a lot of wasted space!

Signed Arithmetic

If you search the DCPU-16 spec for “signed”, the only hit is “16 bit unsigned words” at the top. That’s your only clue as to the fact that it only supports unsigned arithmetic. Implementing signed arithmetic in a purely unsigned CPU is very annoying - and yet we all know that int is signed in C. This is going to make a lot of typical code much slower and more bloated than it needs to be.

Signed support doesn’t take that much: you need arithmetic right shift, signed multiplication and division, and signed comparisons. Addition and subtraction are, of course, the same as unsigned, thanks to two’s-complement arithmetic.

Overflow

The idea of using a full 16-bit overflow register (O) is unorthodox, but for a software CPU, I like it: it makes extending computation to higher bit widths nicer, and works well for things like shifts and rotates. However, actually using O in DCPU-16 is flawed: it doesn’t work for higher than 32-bit widths without extra effort. Consider a 48-bit addition, adding X..Z to A..C, the naive way:

ADD A, X
ADD B, O
ADD B, Y
ADD C, O
ADD C, Z

Looks reasonable, right? Spot the bug? No? Try these:

A = 0xFFFF
B = 0xFFFF
C = 0x0000
X = 0x0001
Y = 0x0000
Z = 0x0000

Obviously, 0x0000 0000 0001 + 0x0000 ffff ffff == 0x0001 0000 0000. Alas,

ADD A, X  ; A = 0x0000, O = 0x0001
ADD B, O  ; B = 0x0000, O = 0x0001
ADD B, Y  ; B = 0x0000, O = 0x0000 - the overflow in O is lost
ADD C, O  ; C = 0x0000, O = 0x0000
ADD C, Z  ; C = 0x0000, O = 0x0000

To make this work, we have to insert an extra addition:

ADD A, X
ADD B, O
ADD C, O
ADD B, Y
ADD C, O
ADD C, Z

So, how do we add 64 bits properly?

ADD A, Y
ADD B, O
ADD C, O
ADD X, O
ADD B, Z
ADD C, O
ADD X, O
ADD C, I
ADD X, O
ADD X, J

Thus, code size (and execution time) is proportional to the number of words to add, squared. Subtraction, of course, has the same issue.

Similarly, for multi-word shifts and rotates, we need to save O to a temporary register after each shifted word, then BOR it in after the next word is shifted, wasting a register. While this only triples the instruction count, it’s still less efficient than it could be.

The solution is to have instructions that add in O in one go, much like “add with carry” and “subtract with carry” in other CPUs. This allows for much more compact code that can add/sub/shift wider ints at one instruction per word.

Instruction Timings

This is less important, but worth mentioning: the instruction timings are all over the place. For example, it doesn’t really make sense for ADD to take longer than binary ops (certainly not in the 80s), and yet for MUL to take only as long as ADD. Same with shifts: why 2 cycles? The only explanation for this that I can come up with is that internally the CPU only has an 8-bit ALU and barrel shifter, but that still doesn’t explain the multiplication cost (it should be higher). Division should be much slower – more like 16 or 32 cycles.

Conclusion

I think the idea of integrating a fictional 16-bit CPU into a game is a great idea, but making sure the CPU is amenable to higher-level coding will go a long way towards allowing people to code more comfortably while still keeping the spirit of an embedded 16-bit system. I’ve written an alternate proposal for the DCPU-16 that attempts to address these concerns. Let me know what you think!