plaidCTF 2014 - paris (re300)

For PlaidCTF2014, Eindbazen and fail0verflow joined forces as 0xffa, the Final Fail Alliance.
Don't miss out on other write-ups at Eindbazen's site!
paris
Reversing (300 pts)
-------------------
This binary was found on some of our Windows machines. It's got The
Plague written all over it. What secrets are contained inside?

We are greeted by a Windows executable. Since I hate Windows and I can’t be arsed to pull up a Windows VM and debugger, I decided to solve this one statically. Time to load it into IDA.

After some boring boilerplate in the main function, there’s a call to this main loop:

.text:00402066 mainloop        proc near
.text:00402066                 xor     ecx, ecx
.text:00402068
.text:00402068 loc_402068:
.text:00402068                 push    offset seh_handler
.text:0040206D                 push    large dword ptr fs:0
.text:00402074                 mov     large fs:0, esp
.text:0040207B                 mov     eax, 0
.text:00402080                 mov     [eax], eax
.text:00402082                 sub     edi, 1111h
.text:00402088                 pop     large dword ptr fs:0
.text:0040208F                 add     esp, 4
.text:00402092                 cmp     done, 1
.text:00402099                 jz      nullsub_1
.text:0040209F                 jmp     short loc_402068
.text:0040209F mainloop        endp
.text:0040209F

I’ve had the unfortunate luck of bashing my head against pthreads compabitility issues in Wine twice before (once for winepulse and twice for gstreamer), so I knew that fs:0 is supposed to be a pointer to the Thread Information Block. The first field is the SEH frame. Thus, the code does this:

  • Sets up a new exception handler pointing to the seh_handler function
  • Dereferences a NULL pointer
  • Subtracts 0x1111 from edi
  • Removes the exception handler
  • Loops while done is zero

The subtraction immediately jumps out as a WTF. We’ll see what it does later on.

The seh_handler simply iterates once through a table of function pointers every time it is called: on the first exception it runs the first entry, then the second entry, etc. There are 20 entries or states.

Now, in my mind, this started to stink of being a VM of some kind, but there’s something odd. Normally, in a VM, you’ll fetch the opcode, then index into a table of instruction handlers. Here, the table is always traversed in order. Let’s continue and see where this takes us.

What do the 20 functions do? First they call one of three functions: init_onebyte, init_twobytes, init_threebytes.

.text:0040247C state_0         proc far
.text:0040247C                 call    init_onebyte
.text:00402481                 sub     eax, 0
.text:00402484                 setz    al
.text:00402487                 jmp     near ptr state_end
.text:00402487 state_0         endp

After looking up the arguments passed to SEH handlers (turns out the code wants to use the ContextRecord structure), this is what we get (note that some of the names are spoilers and I only added them later).

.text:00401C91 init_common      proc near
.text:00401C91 
.text:00401C91
.text:00401C91 crap1             = dword ptr  4
.text:00401C91 crap2             = dword ptr  8
.text:00401C91 ExceptionRecord   = dword ptr  0Ch
.text:00401C91 EstablisherFrame  = dword ptr  10h
.text:00401C91 ContextRecord     = dword ptr  14h
.text:00401C91 DispatcherContext = dword ptr  18h
.text:00401C91
.text:00401C91                 mov     esi, [esp+ContextRecord]
.text:00401C95                 mov     edi, offset ContextRecordBuffer
.text:00401C9A                 mov     ecx, 34h
.text:00401C9F                 rep movsd
.text:00401CA1                 mov     esi, offset ram
.text:00401CA6                 mov     edi, offset ram_save
.text:00401CAB                 mov     ecx, 100h
.text:00401CB0                 rep movsd
.text:00401CB2                 mov     esi, offset stack
.text:00401CB7                 mov     edi, offset stack_save
.text:00401CBC                 mov     ecx, 80h
.text:00401CC1                 rep movsd
.text:00401CC3                 mov     al, eq_flag
.text:00401CC8                 mov     eq_flag_save, al
.text:00401CCD                 mov     al, done
.text:00401CD2                 mov     done_save, al
.text:00401CD7                 mov     ax, program_counter
.text:00401CDD                 mov     program_counter_save, ax
.text:00401CE3                 mov     ax, stack_pointer
.text:00401CE9                 mov     stack_pointer_save, ax
.text:00401CEF                 retn
.text:00401CEF init_common      endp

.text:00401CF0 init_onebyte    proc near
.text:00401CF0
.text:00401CF0                 call    init_common
.text:00401CF5                 movzx   eax, program_counter
.text:00401CFC                 mov     ebx, eax
.text:00401CFE                 inc     ebx
.text:00401CFF                 mov     program_counter, bx
.text:00401D06                 add     eax, offset rom
.text:00401D0B                 movzx   eax, byte ptr [eax]
.text:00401D0E                 mov     ebx, eax
.text:00401D10                 shr     eax, 3
.text:00401D13                 and     ebx, 7
.text:00401D16                 shl     ebx, 1
.text:00401D18                 retn
.text:00401D18 init_onebyte    endp

.text:00401D19 init_twobytes   proc near
.text:00401D19
.text:00401D19                 call    init_common
.text:00401D1E                 movzx   ebx, program_counter
.text:00401D25                 mov     eax, ebx
.text:00401D27                 add     eax, 2
.text:00401D2A                 mov     program_counter, ax
.text:00401D30                 add     ebx, offset rom
.text:00401D36                 mov     ah, [ebx]
.text:00401D38                 mov     al, [ebx+1]
.text:00401D3B                 mov     ebx, eax
.text:00401D3D                 mov     ecx, eax
.text:00401D3F                 shr     eax, 6
.text:00401D42                 shr     ebx, 3
.text:00401D45                 and     ebx, 7
.text:00401D48                 shl     ebx, 1
.text:00401D4A                 and     ecx, 7
.text:00401D4D                 shl     ecx, 1
.text:00401D4F                 retn
.text:00401D4F init_twobytes   endp

.text:00401D50 init_threebytes proc near
.text:00401D50
.text:00401D50                 call    init_common
.text:00401D55                 movzx   ebx, program_counter
.text:00401D5C                 mov     eax, ebx
.text:00401D5E                 add     eax, 3
.text:00401D61                 mov     program_counter, ax
.text:00401D67                 add     ebx, offset rom
.text:00401D6D                 mov     ah, [ebx]
.text:00401D6F                 shl     eax, 8
.text:00401D72                 mov     al, [ebx+1]
.text:00401D75                 mov     ah, [ebx+2]
.text:00401D78                 mov     ebx, eax
.text:00401D7A                 mov     ecx, eax
.text:00401D7C                 shr     eax, 13h
.text:00401D7F                 shr     ebx, 10h
.text:00401D82                 and     ebx, 7
.text:00401D85                 shl     ebx, 1
.text:00401D87                 and     ecx, 0FFFFh
.text:00401D8D                 retn
.text:00401D8D init_threebytes endp

init_<n>bytes sure seem like they’re parsing an instruction encoding, but running every possible instruction in a loop doesn’t make any sense. Further, init_common seems to be saving the state of the world to some temporary buffers.

Let’s look at state_end:

.text:00401FCA state_end       proc far
.text:00401FCA
.text:00401FCA
.text:00401FCA arg_4           = dword ptr  0Ch
.text:00401FCA
.text:00401FCA                 mov     edx, [esp+arg_4]
.text:00401FCE                 cmp     al, 1
.text:00401FD0                 jz      loc_402467
.text:00401FD6                 mov     esi, offset ContextRecordBuffer
.text:00401FDB                 mov     edi, edx
.text:00401FDD                 mov     ecx, 34h
.text:00401FE2                 rep movsd
.text:00401FE4                 mov     esi, offset ram_save
.text:00401FE9                 mov     edi, offset ram
.text:00401FEE                 mov     ecx, 100h
.text:00401FF3                 rep movsd
.text:00401FF5                 mov     esi, offset stack_save
.text:00401FFA                 mov     edi, offset stack
.text:00401FFF                 mov     ecx, 80h
.text:00402004                 rep movsd
.text:00402006                 mov     al, eq_flag_save
.text:0040200B                 mov     eq_flag, al
.text:00402010                 mov     al, done_save
.text:00402015                 mov     done, al
.text:0040201A                 mov     ax, program_counter_save
.text:00402020                 mov     program_counter, ax
.text:00402026                 mov     ax, stack_pointer_save
.text:0040202C                 mov     stack_pointer, ax
.text:00402032                 jmp     loc_402467
.text:00402032 state_end       endp

Looks like it optionally restores all the data that was saved by init_common, but leaves it alone if al is 1.

After staring at a couple of other state functions, it clicked: this is a VM indeed, but it’s a rollback-oriented VM. The VM decodes and runs every possible instruction that might be executed - and then checks the opcode after the fact. If the opcode was not the correct one for this instruction, then the state of the VM is rolled back to the state before the execution. Evil.

So what are we dealing with? This is a 16-bit VM with 8 registers that map to 4 32-bit x86 registers in the code outside the SEH. Instructions are one to three bytes long. It has a Harvard architecture (ROM and RAM are separate), and there’s also a dedicated data (not function call!) stack and an EQ flag. The registers map as follows:

r0   edi[15:0]
r1   edi[31:16]
r2   esi[15:0]
r3   esi[31:16]
r4   ebx[15:0]
r5   ebx[31:16]
r6   edx[15:0]
r7   edx[31:16]

Since the VM’s registers are the x86 registers of the faulted code, and edi is r1:r0, r1:r0 are decremented by 0x00001111 every machine cycle (not every instruction!) and the number of machine cycles between instructions depends on their distance in the handler table. Thankfully this only matters at one point in the VM’s code.

After reading through all the instruction handlers, this is the instruction set that I came up with:

___b0___ ___b1___ ___b2___  ID  OP            WTF
00000xxx                     0  NOP
10000000 01sssddd            1  MOV Rd, Rs
10000000 10sssddd            2  STW [Rd], Rs  Store Word (little-endian)
10000000 11sssddd            3  LWX Rd, [Rs]  Load Word Exchanged (big-endian!)
10011ddd llllllll llllllll   4  MOV Rd, #lit  Note that lit is little-endian
00100110 00sssddd            5  ADD Rd, Rs
00100110 01sssddd            6  SUB Rd, Rs    Rd -= Rs
00100110 10sssddd            7  XOR Rd, Rs
00100110 11sssddd            8  AND Rd, Rs
01001ddd                     9  SRB Rd        Rd >>= 8
10101ddd                    10  NOT Rd        Rd = ~Rd
00010ddd                    11  INC Rd
00001111 11aaabbb           12  CMP Ra, Rb    EQ = Ra == Rb
11111xxx llllllll llllllll  13  JMP #lit      Absolute
11101xxx llllllll llllllll  14  JEQ #lit      If EQ == 1
00111sss                    15  PSH Rs        Dedicated stack
11000ddd                    16  POP Rd        Dedicated stack
11011ddd                    17  SWP Rd        Rd = (Rd>>8) | (Rd<<8)  (Byte swap)
01010sss                    18  WTF Rs        ram[0x200..0x3ff] ^= rom[0x6c + Rs]
10100xxx                    19  HLT           End program

And this is the memory map of the VM:

ROM
0x00..0x6b program
0x6c..0x16b p-box

RAM
0x000..0x0ff password input
0x100..0x1ff expected stack contents
0x200..0x3ff s-box

I wrote a quick and dirty disassembler for the instruction set. This is the disassembled code that runs on the VM:

start:
    NOP
    NOP
    NOP
    MOV r2, 0x3133 ; NULL terminator after mangling
    MOV r3, 0x0 ; password index
    MOV r4, 0xff00 ; masks
    MOV r5, 0xff

pw_loop:
    ; load two bytes of the password
    ; This is where that ugly edi -= 0x1111 in mainloop matters.
    ; There are 18 instructions between the slots for LWX and reg MOV,
    ; so 18*0x1111 = 0x13332 = 0x3332 (for r0)
    ; r7 = ~(swap16(swap16(ram[r3]) - 0x3332))
    LWX r0, [r3]
    ; !!!! r0 -= 0x3332
    MOV r7, r0
    SWP r7
    NOT r7
    CMP r2, r7
    JEQ read_done  ; end of string

    ; XOR both (mangled) bytes together
    MOV r6, r7
    AND r6, r4 ;0xff00
    AND r7, r5 ;0x00ff
    SRB r6
    XOR r7, r6

    ; r7 = sbox[r7] (16b)
    MOV r6, 0x200
    ADD r7, r7
    ADD r6, r7
    LWX r7, [r6]
    SWP r7

    ; r7 ^= peek()
    ; push(r7)
    POP r6
    XOR r7, r6
    PSH r6
    PSH r7

    ; sbox[0..0x200] ^= pbox[r3]
    WTF r3

    ; inc r3 and loop
    INC r3
    JMP pw_loop

read_done:
    XOR r7, r7
    MOV r2, 0x100
    MOV r6, 0xaf21

checkloop:
    LWX r5, [r2]
    SWP r5
    INC r2
    INC r2
    POP r3
    CMP r6, r5
    JEQ check_last
    CMP r3, r5
    JEQ checkloop
    MOV r3, 0x0
    MOV r2, 0x0
    HLT

check_last:
    MOV r5, 0x5a4d
    CMP r5, r3
    JEQ success
    MOV r3, 0x0
    MOV r2, 0x0
    HLT

success:
    MOV r3, 0xdead
    MOV r2, 0xbeef
    HLT

The program essentially mangles the passphrase and then compares it with a constant. Running the process backwards recovers the original expected password. I didn’t bother reversing the first part of the mangling process, so instead I just brute force all possible 256 character choices and pick whichever one matches the expected output. This is the code that both implements the disassembler and recovers the flag:

rom = [int(x,16) for x in """
00 00 00 9A 33 31 9B 00  00 9C 00 FF 9D FF 00 80
D8 80 47 DF AF 0F D7 EF  37 00 80 7E 26 E6 26 EF
4E 26 B7 9E 00 02 26 3F  26 3E 80 F7 DF C6 26 B7
3E 3F 53 13 FF 0F 00 26  BF 9A 00 01 9E 21 AF 80
D5 DD 12 12 C3 0F F5 EF  56 00 0F DD EF 3F 00 9B
00 00 9A 00 00 A7 9D 4D  5A 0F EB EF 65 00 9B 00
00 9A 00 00 A7 9B AD DE  9A EF BE A7""".split()]

pbox = [int(x,16) for x in """A9 2D F2 6D
2E 34 AA 55 7A C3 94 CC  A2 11 D8 B9 A5 F0 E2 8C
54 CB 5D 18 D8 79 5F 3A  15 9E DA EA FC 77 2B 91
4F 21 29 26 1F 60 8F C4  BE 63 87 D8 81 1E 3F 76
E8 61 EB 94 F6 FA 74 47  FB 52 BA 53 7C 59 6F 51
3E C8 EE 2F 3A 69 80 1A  CF 74 60 CD 0F C9 72 C7
F9 45 AD 91 45 95 45 14  CF F5 57 6F 39 5A D8 3C
DF 96 F0 CE 90 BE 29 8E  FE 67 D7 7B 8D 4F 22 D9
7A 76 47 98 50 4A F7 47  4C 92 63 44 98 D9 34 2D
F8 C2 95 CA D4 BC 89 C6  98 64 16 BC AD E2 0E FD
D0 58 6D 75 C9 10 D6 5B  0F 2A BB CF 32 3D B4 4A
FF 36 B5 D2 27 4A 91 B8  A6 0C 33 3A 35 F2 66 39
7F 7A FB 4B 35 41 1E C2  50 E1 4F D5 60 B4 1E 7D
E4 35 DC FC 3B A9 78 F5  66 AD A0 5E 93 17 DB 99
59 61 86 2F 6F 63 F8 F6  EF FB 94 47 9B 17 D8 5D
08 26 40 E9 1C 73 F5 1A  4D B4 85 02 E9 CF CF 14
65 CA 74 E7 F9 9D B6 1A  C1 A7 F2 94
""".split()]

ram = [int(x,16) for x in """
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

0B 2E 02 6D 92 74 0C 87  B9 93 B3 ED 2C 31 07 71
10 7D 07 20 C6 E7 1B 3A  D8 BA 17 94 6B FA 6C BE
1D 62 3B 4D AD 47 7A 7A  9D 3E A2 53 2F F2 A9 D1
74 F5 73 81 BC 11 15 AE  79 61 21 AF 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

F2 AE 8B 4F 49 C3 E5 A3  A3 1A 2D A5 90 7A 8E 08
E5 0F 41 61 52 D7 40 C0  78 A9 5D 2B 8B C4 33 C2
42 CB 22 81 77 3E 09 EF  CB 6C CF B0 F4 C4 75 10
C1 93 AE D9 1D EC E8 11  16 DC 2D 3B 25 6C B8 C5
C3 C1 1C A9 18 85 E1 AB  1D D0 0C A0 B9 43 97 E3
85 B5 DB 20 48 FC 1E CD  79 5B BB BD 02 E3 32 E4
63 C0 1D 95 CA 48 63 9D  87 22 AA E9 7E 26 26 0A
9B 37 50 54 CC 61 13 7E  5A F8 BF F7 45 45 82 AB
1C 6F AB E8 27 48 07 25  71 AB 42 29 73 C3 55 E4
54 F6 0B D6 17 39 B5 AE  E8 5C 85 05 E5 2C DD 3C
3B DA 09 7D C4 42 86 12  8B 2B 72 99 F8 86 9D C3
F1 65 85 2E 05 53 6F 48  ED 13 9C AB 68 44 57 03
6A A6 D7 9E B9 E6 16 A3  27 F6 CC 5A C6 CB EA 06
05 CA 4F 95 90 93 72 42  CE A3 08 1A 47 2A B8 D0
76 29 86 24 62 E5 32 D0  7D B2 0D C7 AD A7 EE 98
26 B9 DD 9D 57 A6 8B E9  13 3E 42 0E 3D AF E4 19

2B 7D 17 90 C4 52 9E 5B  D4 96 5F 6D DC E7 08 99
2A 0B 54 E5 A4 E0 A1 7C  81 56 F2 02 76 46 4C B6
DA E3 F6 4A C0 07 1B 6B  16 8A 0F 6A 0D 18 F6 8C
CC B3 C2 6A BD E9 3D 0E  F6 15 7C 4C 57 61 F2 A9
08 5B 06 72 97 9B FB F6  B0 7D 89 69 AF 3D FD 93
BD A0 C5 24 D1 E3 20 87  E9 BC 56 8D 2D 7A B3 66
5C E9 DA C9 AE 4B B0 53  15 8F F2 F1 99 A3 3B 28
CB 5B 9C 31 EB 7B C2 F1  5B 8C BF DC C5 66 B9 B2
A6 DC 26 52 39 30 64 55  9B 4B 00 E1 41 E0 B1 02
55 DE C9 EA 27 2D 45 D9  27 D2 17 3E 88 B4 3E BD
B0 E4 25 68 65 9B AB 0D  FB A3 2C DC CF 58 98 58
EB EA 71 05 E1 60 95 56  F0 E6 34 3B 7D 28 65 45
70 02 37 AD 2A 70 B0 46  EF 9C F8 C0 56 2D 49 3A
C9 19 F7 B1 46 28 EF 64  01 07 58 BE EC E7 B4 8D
D6 B1 AC 9E F4 12 9E BB  7A 33 39 93 82 38 82 34
8C C3 00 88 8E 12 9C C3  4D 62 2F DC 7C 5A AA A5
""".split()]

INSTR1 = {
    0x00: "NOP",
    0x09: "SRB r%(b)d",
    0x15: "NOT r%(b)d",
    0x02: "INC r%(b)d",
    0x07: "PSH r%(b)d",
    0x18: "POP r%(b)d",
    0x1b: "SWP r%(b)d",
    0x0a: "WTF r%(b)d",
    0x14: "HLT",
}

INSTR2 = {
    0x201: "MOV r%(c)d, r%(b)d",
    0x202: "STW [r%(c)d], r%(b)d",
    0x203: "LWX r%(c)d, [r%(b)d]",
    0x098: "ADD r%(c)d, r%(b)d",
    0x099: "SUB r%(c)d, r%(b)d",
    0x09a: "XOR r%(c)d, r%(b)d",
    0x09b: "AND r%(c)d, r%(b)d",
    0x03f: "CMP r%(b)d, r%(c)d",
}

INSTR3 = {
    0x13: "MOV r%(b)d, #0x%(c)x",
    0x1f: "JMP loc_%(c)x",
    0x1d: "JEQ loc_%(c)x",
}

ops = []

def disasm(rom):
    labels = set()
    i = 0
    while i < len(rom):
        a = rom[i] >> 3
        if a in (i>>5 for i in INSTR2):
            a = (rom[i] << 2) | (rom[i+1] >> 6)
            b = (rom[i+1] >> 3) & 7
            c = rom[i+1] & 7
            ops.append((i, INSTR2[a] % {"b":b, "c":c}))
            i += 2
        elif a in INSTR3:
            a = rom[i] >> 3
            b = rom[i] & 7
            c = rom[i+1] | (rom[i+2] << 8)
            ops.append((i, INSTR3[a] % {"b":b, "c":c}))
            if (a in (0x1f, 0x1d)):
                labels.add(c)
            i += 3
        elif a in INSTR1:
            a = rom[i] >> 3
            b = rom[i] & 7
            ops.append((i, INSTR1[a] % {"b":b}))
            i += 1
        else:
            raise Exception("Undefined minor opcode 0x%02x at 0x%x" % (a, i))
        #print ops[-1]

    print "start:"
    for addr, op in ops:
        if addr in labels:
            print "loc_%x:" % addr
        print "    " + op

def load16(b):
    l = []
    for i in xrange(0, len(b), 2):
        l.append(b[i] | (b[i+1] << 8))
    return l

# Disassemble the rom...
disasm(rom)

# Work backwards
estack = []
sboxes = []

# Load expected data into estack
for i in xrange(0x100, 0x200, 2):
    w = ram[i] | (ram[i+1] << 8)
    if w == 0xaf21:
        break
    estack.append(w)

# Generate as many sboxes as needed, performing the XOR mangling
for i in xrange(len(estack)):
    sboxes.append(load16(ram[0x200:0x400]))
    for j in xrange(0x200, 0x400):
        ram[j] ^= pbox[i]

# Last element in the expected stack
# (this is the element that is already in the stack at program start)
estack.append(0x5a4d)

# Work backwards to obtain flag
lb = 0 # Previous (next) output byte (0 = NULL terminator)
out = ""
while len(estack) > 1:
    # Get expected word from stack
    x = estack.pop(0)
    # XOR with the previous word
    x ^= estack[0]
    # Pop the last sbox
    sbox = sboxes.pop()
    # Undo the sbox (this will raise an exception if fail)
    x = sbox.index(x)
    # Brute force the final mangling instead of inverting the function...
    for i in xrange(256):
        # Big endian load together w/ last byte
        v = (i << 8) | lb
        # EDI mangle (r0 -= 0x3332)
        v = (v - 0x3332) & 0xffff
        # Swap bytes again
        v = (v >> 8) | ((v << 8) & 0xff00)
        # Invert
        v = (~v) & 0xffff
        # XOR bytes together
        v = (v >> 8) ^ (v & 0xff)
        # Hope we can find a match...
        if v == x:
            lb = i
            out += chr(i)
            break

# Reverse flag
print out[::-1]

Result: V1rTu4L_M4ch1n3s_4r3_Aw3s0m3!