Introduction ------------ This is a sketch of an alternate proposal for DCPU-16. While it's definitely not the ideal CPU, I do believe it is markedly better than the old proposal. This design should still allow for compact code and a decent C compiler. It also maintains some aspects of the original design. There's some room for expansion/rework (quite a few parts). Anyone should feel free to hack on or improve upon or take ideas from the design as they see fit. This is just a possible concept that I came up with. General design -------------- Load-store architecture (RISC style), 16 general registers. Rationale: Load-store means explicit loads and stores (duh), but 16 registers means you have reasonable working space for local variables and a decent calling convention. Endianness ---------- Little endian, because Notch said so (and unlike the original DCPU-16 design, which is word-based and doesn't care about endianness at the ISA level, this one does). Unfortunately it's going to take some creativity to make a byte-addressed 16-bit machine work with the 0x10c storyline, which depends on an endian-screwup swapping the *words* in a 64-bit number, not the *bytes*... I guess he could make the CPU big-endian and the ABI wordwise little-endian, i.e. mixed-endian (this would certainly be a great excuse for the sleep cell SNAFU). Memory ------ 128kB, divided into two regions: low 64k and high 64k. The low 64k can be used for instructions and data, but the high 64k can only be used for instructions and rarely addressed as word-wise data. Rationale: This keeps the original amount of memory, while allowing for bytewise pointers. It is most likely that large programs will benefit from more code space than more data space. Allowing byte-wise loads and stores means char arrays are possible without fiddling with shifts, which helps save space. This is still a Von Neumann architecture, as code is allowed in the low half, so self-modifying code is still entirely possible. Code/function pointers are stored as address>>1 (and thus have a different representation to data pointers, but this is entirely allowed by the C standard - and anyone doing self-modifying code knows what he's doing well enough to be able to shift things himself). There are extended instructions to read/write memory by code pointers for the purpose of loading code (but they could be used for large data tables too). Registers --------- 16 generally addressable registers: - R0 - R13: general purpose registers - R14: SP (supports extra addressing modes) - R15: PC - One implicit 16-bit carry register (C) PC points to the memory word (e.g. PC=0x8000 points to memory location 0x10000); however, regular pointers point to a byte. Rationale: The wide carry register (O in the original design) works well for higher width shifts and also serves as the high bits for mul/div, so that can stay, although div works differently here than in the original design. It is not a general purpose reg but rather a separate register, since many instructions implicitly write it and I'd rather not doom a general register to being clobbered repeatedly. Instruction format ------------------ Having one rigid instruction format is too limiting. However, with just a few simple formats we can cover things much more nicely while still keeping the decoder simple. Instructions are 16 bits, sometimes followed by a 16-bit immediate value. They are always 16-bit aligned in memory. Form A (arithmetic): 0oooooaaaabbbbbb (optionally followed by a 16-bit immediate constant) Form B (load/store): 10ooaaaammmmmmmm (optionally followed by a 16-bit immediate constant) Form C (branch): 110odddddddddddd Form D (misc): 111oooooxxxxyyyy (optionally followed by a 16-bit immediate constant) Note how the opcode is entirely determined by the top 8 bits, which means you can implement the decoder as a 256-entry lookup table or switch. Operands: a is a register r0-r15 b has the following modes: 0x00 - 0x0f: register r0-r15 0x10 - 0x1f: 0x10: immediate constant (next 16 bits) 0x11: 0xffff (-1) 0x12: 0x13: 0x14: 0x15-0x1f: 1<<5 through 1<<15 (power of two constant) 0x20 - 0x3f: constant 0x00-0x1f (not sign extended) Rationale: Note that this covers 1<<4 and lower. Subtracting small values can be done with sub. m has the following modes: 0x00 - 0x0e: [r0..r14] 0x0f: [r14 - 2]! // predecrement by 2, i.e. push 0x10 - 0x1e: [r0..r14 + immediate offset] 0x1f: [immediate address] 0x20 - 0x2e: [r0..r14], immediate offset // postupdate 0x2f: [r14], 2 // postincrement by 2, i.e. pop 0x30 - 0xff: [r14 + 0x00..0xcf] // stack frame relative How to tell whether there is an immediate constant following the instruction: Form A: b == 0x10 Form B: m >= 0x10 && m <= 0x2e Form D: o == 1 Opcodes ------- Notation: x:y means x concatenated with y (x is the high bits) Form A opcodes: 00: add c:a = a + b 01: addc c:a = a + b + c 02: sub c:a = a - b 03: subc c:a = a - b + c 04: rsb c:a = b - a 05: rsbc c:a = b - a + c 06: shl // interpret b&0x1f as a signed int, shift right if negative if (!(b & 0x10)) c:a = a << (b & 0xf); else a:c = a << (b & 0xf); 07: shlc // same, but shift in carry if (!(b & 0x10)) c:a = (a << b) | c; else a:c = (a << (b & 0xf)) | (c << 16); 08: sar a = a >> (b & 0xf); // arithmetic shift 09: mov a = b 0a: and a &= b 0b: bcl a &= ~b 0c: or a |= b 0d: xor a ^= b 0e: mul c:a = a * b // unsigned 0f: muls c:a = a * b // signed 10: div a = c:a / b // unsigned 11: divs a = c:a / b // signed 12: mod a %= b // unsigned 13: mods a %= b // signed 14: 15: 16: ifeq skip if !(a == b) 17: ifne skip if !(a != b) 18: ifgt skip if !(a > b) // signed 19: ifle skip if !(a <= b) // signed 1a: iflt skip if !(a < b) // signed 1b: ifge skip if !(a >= b) // signed 1c: ifhi skip if !(a > b) // unsigned 1d: ifls skip if !(a <= b) // unsigned 1e: iflo skip if !(a < b) // unsigned 1f: ifhs skip if !(a >= b) // unsigned Note that if* needs to minimally decode the next instruction to determine whether it has a 16-bit operand to be skipped too (see the previous section). Form B opcodes: 0: lw a = [m] // word 1: stw [m] = a // word 2: lb a = 0:[m] // byte 3: stb [m] = a // byte (ignores high bits) Unaligned accesses are, of course, invalid ;) Form C opcodes: 0: jmp pc += d // d is sign-extended from 12 bits to 16 1: jsr r14 -= 2; [r14] = pc + 1; pc += d // same as above Form D opcodes: 00: jsr rx r14 -= 2; [r14] = pc + 1; pc = rx // indirect subroutine jump to rx 01: jsr imm r14 -= 2; [r14] = pc + 1; pc = imm // long subroutine jump to imm 02: lpw rx = [ry] // load program word (128kB word-wise addressing) 03: stpw [ry] = rx // store program word (128kB word-wise addressing) 04..1f Recipes ------- clear c (e.g. before 16/16=16 division) shl r0, 0 // or add r0, 0, or sub... write c (e.g. before 32/16=16 division). Note: clears register too. shl rX, 16 read c (e.g. after multiplication) subc rX, rX // clears c or shlc rX, 16 // exchanges c and rX shift right shl rX, -bits // note: assembler should use one-word form as (-bits)&0x1f not xor rX, 0xffff // fits in one word, b=0x11 long branch mov pc, destination // can be immediate or register pop (ret with rX=pc) lw rX, [r14], 2 // i.e. pop rX, uses m=0xf push stw rX, [r14-2]!, 2 // i.e. push rX, uses m=0x1f Instruction cycle times ----------------------- Example instruction cycle times: Base instruction time: 1 cycle if 16-bit immediate: +1 cycle mul: +1 cycle div: +31 cycles mod: +15 cycles jumps (anything dest=pc, jsr, jmp, ret): +1 cycle loads/stores (incl. push, pop): +1 cycle conditional instructions: if condition is false (skip), +1 cycle For example, mov pc, 0x1234 takes 3 cycles, but jmp 0x1234 (when 0x1234 is within relative addressing range) takes only 2. jsr 0x1234 takes 4 cycles when out of relative range (Form D 1), but 3 when in range (Form C 1). ret takes 3 cycles. And something like div pc, 0x1234 (which is insane in and of itself) would take 34 cycles. License ------- I hereby place this document in the public domain. -- Hector Martin "marcan"