I’ve always liked the idea of building complex logic systems out of a simple primitive that is just powerful enough to construct all logic - particularly in videogames. For example, in LittleBigPlanet, you can build a Tic-Tac-Toe AI out of physical elements like pistons and magnetic switches. In Minecraft, you can build an ALU out of a primitive construction element that behaves, essentially, as a NOT gate. And, if games aren’t your thing, you can build CMOS logic out of UNIX pipes with a simple “mosfet” program.
Just a few days ago, Notch, the creator of Minecraft, revealed a new game, 0x10c. Instead of giving players a simple logic element, it will include a full-blown 16-bit CPU that can be programmed. I find this intriguing, because it allows for much more complex development yet it doesn’t step right into the boring world of “let’s just throw in a lua scripting engine and call it a day”. You’re still limited by emulation speed and by memory constraints.
However, after checking out the preliminary spec for the CPU, I was a bit disappointed. While the general idea is good, that particular design suffers from certain design flaws that will make it a lot more painful to use than it should be. Specifically, with 128kB of memory, you’d expect to be able to program it in C or some other high level language. Unfortunately, the design is rather hostile to having a proper C compiler targeted to it. The situation reminds me of PIC16, for which I refuse to use C compilers because they all suck (and only rarely use them for PIC18, for that matter), only this is a nicer 16-bit architecture. Things shouldn’t need to be this way.
These are some of the design issues that I found with the architecture:
DCPU-16 is, as its name implies, a 16-bit CPU. However, what isn’t so obvious at first glance is that it’s only a 16-bit CPU. Memory is addressed word-wise and there are no 8-bit loads and stores.
Why is this a problem? After all, you can still emulate 8-bit loads and stores
by using masking and shifting. Ah, but how do you address bytes? To address a
byte, you need 17 bits: 16 bits for the word address and one bit for the byte.
This doesn’t fit in one 16-bit word. For a C compiler, that means that
needs to be effectively two words (32 bits) wide. And since any pointer can be
cast to and from a
char *, that basically means that all pointers have to be 32
bits wide. This complicates pointer arithmetic and generates a lot of boilerplate
code every time there’s a pointer access, for little benefit. The
read-modify-write masking required for the 8-bit writes is also wasteful.
There’s an alternative: make
char 16 bits and
CHAR_BIT == 16. That is allowed
by the C standard, but I’ll let you take your own guess as to how much code out
there would break with 16-bit bytes.
Eight general registers are the bare minimum, but they’re on the low side. Just look at x86: compilers have been struggling for years with this few registers, and they end up using the stack as more registers (modern x86 CPUs include a lot of logic to speed up stack accesses for this reason). Calling conventions barely have any breathing room to use registers for argument passing (and several such conventions have been developed as a result). Now, this wouldn’t be so terrible, but there’s another problem that makes it worse: DCPU-16’s stack pointer is a special register that supports fewer operations than regular registers. It can only do push and pop operations, not offset-addressing, which means that spilling out data to the stack is much harder because it does not allow random access. In practice, that means that compilers will set up a copy of the stack pointer in the function prologue to act as a frame pointer, and now you’re left with seven registers.
Eight 16-bit registers are also not enough to perform 64-bit operations entirely in registers.
Notch’s design is very orthogonal in this sense, where either instruction operand
can be in memory, or a register, or a constant, for any instruction. Wait, a
constant? Yes, the encoding lets you operate on constants as a destination. This
is silently ignored, but let’s see what the encoding overhead of it is: constants
are only valid as the
a operand for the
IF?? set of instructions, as every
other instruction writes back to operand
a. Of the
IFN are commutative, so using a constant as
a doesn’t gain
you anything over specifying it in operand
b. That leaves two instructions,
IFB, where this is useful, or 2/16 = 12.5% of the instruction set. What is
the overhead? In the operand encoding, half of the values (all of those with the top bit
0x20..0x3f) are constants. That’s one bit, or half of the instruction
encoding space. Since only two instructions take advantage of this,
50% * (100%-12.5%) = 43.75% of the possible 16-bit instructions are useless just
because of this fact. That’s a lot of wasted space!
If you search the DCPU-16 spec for “signed”, the only hit is “16 bit unsigned
words” at the top. That’s your only clue as to the fact that it only supports
unsigned arithmetic. Implementing signed arithmetic in a purely unsigned CPU is
very annoying - and yet we all know that
int is signed in C. This is going to
make a lot of typical code much slower and more bloated than it needs to be.
Signed support doesn’t take that much: you need arithmetic right shift, signed multiplication and division, and signed comparisons. Addition and subtraction are, of course, the same as unsigned, thanks to two’s-complement arithmetic.
The idea of using a full 16-bit overflow register (
O) is unorthodox, but for a
software CPU, I like it: it makes extending computation to higher bit widths
nicer, and works well for things like shifts and rotates. However, actually using
O in DCPU-16 is flawed: it doesn’t work for higher than 32-bit widths without extra
effort. Consider a 48-bit addition, adding
A..C, the naive way:
ADD A, X ADD B, O ADD B, Y ADD C, O ADD C, Z
Looks reasonable, right? Spot the bug? No? Try these:
A = 0xFFFF B = 0xFFFF C = 0x0000 X = 0x0001 Y = 0x0000 Z = 0x0000
0x0000 0000 0001 + 0x0000 ffff ffff == 0x0001 0000 0000. Alas,
ADD A, X ; A = 0x0000, O = 0x0001 ADD B, O ; B = 0x0000, O = 0x0001 ADD B, Y ; B = 0x0000, O = 0x0000 - the overflow in O is lost ADD C, O ; C = 0x0000, O = 0x0000 ADD C, Z ; C = 0x0000, O = 0x0000
To make this work, we have to insert an extra addition:
ADD A, X ADD B, O ADD C, O ADD B, Y ADD C, O ADD C, Z
So, how do we add 64 bits properly?
ADD A, Y ADD B, O ADD C, O ADD X, O ADD B, Z ADD C, O ADD X, O ADD C, I ADD X, O ADD X, J
Thus, code size (and execution time) is proportional to the number of words to add, squared. Subtraction, of course, has the same issue.
Similarly, for multi-word shifts and rotates, we need to save
O to a temporary
register after each shifted word, then
BOR it in after the next word is shifted,
wasting a register. While this only triples the instruction count, it’s still
less efficient than it could be.
The solution is to have instructions that add in
O in one go, much like “add
with carry” and “subtract with carry” in other CPUs. This allows for much more
compact code that can add/sub/shift wider ints at one instruction per word.
This is less important, but worth mentioning: the instruction timings are all
over the place. For example, it doesn’t really make sense for
ADD to take longer
than binary ops (certainly not in the 80s), and yet for
MUL to take only as long
ADD. Same with shifts: why 2 cycles? The only explanation for this that I can
come up with is that internally the CPU only has an 8-bit ALU and barrel shifter,
but that still doesn’t explain the multiplication cost (it should be higher).
Division should be much slower — more like 16 or 32 cycles.
I think the idea of integrating a fictional 16-bit CPU into a game is a great idea, but making sure the CPU is amenable to higher-level coding will go a long way towards allowing people to code more comfortably while still keeping the spirit of an embedded 16-bit system. I’ve written an alternate proposal for the DCPU-16 that attempts to address these concerns. Let me know what you think!