1.22.1 Small number of cases

1
#include <stdio.h> // include standard I/O header
2

3
void f (int a) // define function f taking int a
4
{
5
    switch (a) // switch on value of a
6
    {
7
        case 0: printf ("zero\n"); break; // if a==0, print "zero" and break
8
        case 1: printf ("one\n"); break; // if a==1, print "one" and break
9
        case 2: printf ("two\n"); break; // if a==2, print "two" and break
10
        default: printf ("something unknown\n"); break; // otherwise print "something unknown" and break
11
    };
12
};
13

14
int main() // program entry point
15
{
16
    f (2); // test with value 2
17
};

x86: Non-optimizing MSVC

1
tv64 = -4                                ; temporary variable offset
2
_a$ = 8                                  ; parameter a offset
3
_f PROC
4
    push    ebp                          ; save EBP
5
    mov     ebp, esp                     ; set up stack frame
6
    push    ecx                          ; allocate space for temporary
7

8
    mov     eax, DWORD PTR _a$[ebp]      ; load a into EAX
9
    mov     DWORD PTR tv64[ebp], eax     ; store copy in temporary tv64
10

11
    cmp     DWORD PTR tv64[ebp], 0       ; compare temporary with 0
12
    je      SHORT $LN4@f                 ; if equal, jump to zero case
13

14
    cmp     DWORD PTR tv64[ebp], 1       ; compare with 1
15
    je      SHORT $LN3@f                 ; if equal, jump to one case
16

17
    cmp     DWORD PTR tv64[ebp], 2       ; compare with 2
18
    je      SHORT $LN2@f                 ; if equal, jump to two case
19

20
    jmp     SHORT $LN1@f                 ; otherwise jump to default
21

22
$LN4@f:                                      ; zero case
23
    push    OFFSET $SG739                ; push address of "zero\n"
24
    call    _printf                      ; call printf
25
    add     esp, 4                       ; clean up stack
26
    jmp     SHORT $LN7@f                 ; jump to exit
27

28
$LN3@f:                                      ; one case
29
    push    OFFSET $SG741                ; push address of "one\n"
30
    call    _printf                      ; call printf
31
    add     esp, 4                       ; clean up stack
32
    jmp     SHORT $LN7@f                 ; jump to exit
33

34
$LN2@f:                                      ; two case
35
    push    OFFSET $SG743                ; push address of "two\n"
36
    call    _printf                      ; call printf
37
    add     esp, 4                       ; clean up stack
38
    jmp     SHORT $LN7@f                 ; jump to exit
39

40
$LN1@f:                                      ; default case
41
    push    OFFSET $SG745                ; push address of "something unknown\n"
42
    call    _printf                      ; call printf
43
    add     esp, 4                       ; clean up stack
44

45
$LN7@f:                                      ; function exit
46
    mov     esp, ebp                     ; restore ESP
47
    pop     ebp                          ; restore EBP
48
    ret     0                            ; return
49
_f ENDP

This function with a small number of cases in switch() will look like this:

1
void f (int a)
2
{
3
    if (a==0)
4
        printf ("zero\n");
5
    else if (a==1)
6
        printf ("one\n");
7
    else if (a==2)
8
        printf ("two\n");
9
    else
10
        printf ("something unknown\n");
11
};

The author began explaining that when dealing with a switch() with a small number of cases, it is impossible to determine whether the source code actually contained a switch() or just several if() statements chained together.

This means that switch() is merely syntactic sugar for a large number of nested if() statements.

There is nothing particularly new in the generated code, except that the compiler copied the value of variable a to a temporary local variable named tv64. If we compile this with GCC 4.4.1, even with maximum optimization (-O3), the result will be very similar.

Optimizing MSVC

Now let us enable optimization in MSVC using /Ox:

1
cl 1.c /Fa1.asm /Ox

1
_a$ = 8                                  ; parameter a offset
2
_f PROC
3
    mov     eax, DWORD PTR _a$[esp-4]    ; load a into EAX
4
    sub     eax, 0                       ; subtract 0 (sets flags for comparison with 0)
5
    je      SHORT $LN4@f                 ; if result zero (a==0), jump to zero case
6

7
    sub     eax, 1                       ; subtract 1
8
    je      SHORT $LN3@f                 ; if result zero (a==1), jump to one case
9

10
    sub     eax, 1                       ; subtract 1 again
11
    je      SHORT $LN2@f                 ; if result zero (a==2), jump to two case
12

13
    mov     DWORD PTR _a$[esp-4], OFFSET $SG791 ; load address of "something unknown\n"
14
    jmp     _printf                      ; jump to printf
15

16
$LN2@f:                                      ; two case
17
    mov     DWORD PTR _a$[esp-4], OFFSET $SG789 ; load address of "two\n"
18
    jmp     _printf                      ; jump to printf
19

20
$LN3@f:                                      ; one case
21
    mov     DWORD PTR _a$[esp-4], OFFSET $SG787 ; load address of "one\n"
22
    jmp     _printf                      ; jump to printf
23

24
$LN4@f:                                      ; zero case
25
    mov     DWORD PTR _a$[esp-4], OFFSET $SG785 ; load address of "zero\n"
26
    jmp     _printf                      ; jump to printf
27
_f ENDP

Let us simply explain what happened here:

First:

The value of a is placed in the EAX register, then:

1
sub eax, 0

This looks strange, but the goal is to test whether the value is zero.

If the result is zero → ZF flag is set
Then the JE (Jump if Equal or JZ) instruction works
And we jump directly to label $LN4@f
And "zero" is printed

If the jump did not occur:

Subtract 1
Then subtract 1 again
As soon as the result becomes zero → the appropriate jump occurs

If no jump occurred, print "something unknown".

Second:

We see that the string address is placed in the variable a itself, then printf() is called with JMP instead of CALL.

Why?

The caller of function f() did:

CALL f
This pushed the return address (RA) onto the stack

While f() is executing, the stack layout is:

ESP → return address
ESP+4 → variable a

When calling printf():

We need exactly the same stack layout
But the difference is that the first argument should be the string address

This is what the code did:

Replaced the value of a with the string address
Jumped directly with JMP to printf()

printf() prints the string and then executes RET, popping the return address from the stack and returning directly to the caller of f() without returning to f() itself. Thus f() is completely bypassed at the end. This technique is somewhat similar to the idea of longjmp(), and of course it is all done for speed.

We can summarize this a bit: if the last thing in a function is a call to another function with no code after it, the compiler can:

Modify the arguments
Jump with JMP
And let RET happen from the other function

x32dbg (EX2)

Note: I could not get this example to work perfectly because the compiler is newer than the one in the book, which made a slight difference. I will try to explain the example as much as possible.

We run the same example in x32dbg after compiling and start running it in the debugger.

EAX value is 2 initially, which is the input value of the function.

0 is subtracted from 2 in EAX. Of course, EAX still has 2. But the ZF flag is now 0, meaning the result is not zero:

Then another SUB is performed. EAX finally becomes 0 and the ZF flag is set because the result became zero:

Now the current argument to the function is 2, and 2 is currently on the stack.

The pointer to the string is written, and then the jump occurs. This is the first instruction in printf() in MSVCR100.DLL.

After that, printf() takes the string as the only argument and prints it. This is the last instruction in printf().

The string "two" is now printed on the console window.

And the jump was direct from inside printf() to main() because the RA on the stack does not point to a location in f(), but points to main().

ARM: Optimizing Keil 6/2013 (ARM mode)

1
.text:0000014C f1:
2
    CMP     R0, #0                       ; compare input with 0
3
    ADREQ   R0, aZero                    ; if equal, load address of "zero\n" into R0
4
    BEQ     loc_170                      ; if equal, jump to printf call
5

6
    CMP     R0, #1                       ; compare with 1
7
    ADREQ   R0, aOne                     ; if equal, load address of "one\n"
8
    BEQ     loc_170                      ; if equal, jump to printf
9

10
    CMP     R0, #2                       ; compare with 2
11
    ADRNE   R0, aSomethingUnkno          ; if not equal, load address of "something unknown\n"
12
    ADREQ   R0, aTwo                     ; if equal, load address of "two\n"
13

14
loc_170:
15
    B       __2printf                    ; unconditional jump to printf

Again, looking at this code we cannot determine whether it was a switch() or just several if statements.

In general, we see conditional instructions here (like ADREQ which means "Equal") which execute only if R0 = 0, and then load the address of the string "zero\n" into R0.

The following BEQ instruction transfers control to loc_170 if R0 = 0.

The question is: will BEQ work correctly since ADREQ just changed the value of R0 before it?

Yes, it will work because BEQ checks the flags set by the CMP instruction, and ADREQ does not change any flags at all.

The rest of the instructions are familiar to us. There is only one call to printf() at the end, and we have explained this trick before.

In the end, there are 3 paths leading to printf().

The last CMP R0, #2 instruction exists to check if a = 2. If not, the ADRNE instruction loads the pointer to the string "something unknown\n" into R0, since we are sure at this stage that the variable a is not equal to those numbers.

If R0 = 2, the pointer to the string "two\n" will be loaded into R0 via ADREQ.

ARM: Optimizing Keil 6/2013 (Thumb mode)

1
.text:000000D4 f1:
2
    PUSH    {R4, LR}                     ; save R4 and LR
3

4
    CMP     R0, #0                       ; compare input with 0
5
    BEQ     zero_case                    ; if equal, jump to zero case
6

7
    CMP     R0, #1                       ; compare with 1
8
    BEQ     one_case                     ; if equal, jump to one case
9

10
    CMP     R0, #2                       ; compare with 2
11
    BEQ     two_case                     ; if equal, jump to two case
12

13
    ADR     R0, aSomethingUnkno          ; load address of "something unknown\n"
14
    B       default_case                 ; jump to printf call
15

16
zero_case:
17
    ADR     R0, aZero                    ; load address of "zero\n"
18
    B       default_case                 ; jump to printf call
19

20
one_case:
21
    ADR     R0, aOne                     ; load address of "one\n"
22
    B       default_case                 ; jump to printf call
23

24
two_case:
25
    ADR     R0, aTwo                     ; load address of "two\n"
26

27
default_case:
28
    BL      __2printf                    ; call printf
29
    POP     {R4, PC}                     ; restore R4 and return

As mentioned before, it is not possible to add conditional predicates to most instructions in Thumb mode, so the Thumb code here is similar to CISC-style x86 code and is very easy to understand.

ARM64: Non-optimizing GCC (Linaro) 4.9

1
.LC12: .string "zero"
2
.LC13: .string "one"
3
.LC14: .string "two"
4
.LC15: .string "something unknown"
5

6
f12:
7
    stp     x29, x30, [sp, -32]!          ; save frame pointer and link register
8
    add     x29, sp, 0                   ; set frame pointer
9

10
    str     w0, [x29, 28]                ; store input argument on stack
11
    ldr     w0, [x29, 28]                ; load it back into W0
12

13
    cmp     w0, #1                       ; compare with 1
14
    beq     .L34                         ; if equal, jump to one case
15

16
    cmp     w0, #2                       ; compare with 2
17
    beq     .L35                         ; if equal, jump to two case
18

19
    cmp     w0, wzr                      ; compare with zero (WZR is always zero)
20
    bne     .L38                         ; if not zero, jump to default
21

22
    adrp    x0, .LC12                    ; load page address of "zero"
23
    add     x0, x0, :lo12:.LC12          ; add low 12 bits offset
24
    bl      puts                         ; call puts
25
    b       .L32                         ; jump to exit
26

27
.L34:                                        ; one case
28
    adrp    x0, .LC13                    ; load page address of "one"
29
    add     x0, x0, :lo12:.LC13
30
    bl      puts
31
    b       .L32
32

33
.L35:                                        ; two case
34
    adrp    x0, .LC14                    ; load page address of "two"
35
    add     x0, x0, :lo12:.LC14
36
    bl      puts
37
    b       .L32
38

39
.L38:                                        ; default case
40
    adrp    x0, .LC15                    ; load page address of "something unknown"
41
    add     x0, x0, :lo12:.LC15
42
    bl      puts
43
    nop                                  ; no operation
44

45
.L32:                                        ; function exit
46
    ldp     x29, x30, [sp], 32           ; restore FP/LR and deallocate
47
    ret                                  ; return

The input value type is int, so the W0 register is used instead of the full X0 register.

String pointers are passed to puts() using the ADRP/ADD pair of instructions.

ARM64: Optimizing GCC (Linaro) 4.9

1
f12:
2
    cmp     w0, #1                       ; compare input with 1
3
    beq     .L31                         ; if equal, jump to one case
4

5
    cmp     w0, #2                       ; compare with 2
6
    beq     .L32                         ; if equal, jump to two case
7

8
    cbz     w0, .L35                     ; compare and branch if zero (jump to zero case)
9

10
    ; default case
11
    adrp    x0, .LC15                    ; load page address of "something unknown"
12
    add     x0, x0, :lo12:.LC15
13
    b       puts                         ; unconditional jump to puts
14

15
.L35:                                        ; zero case
16
    adrp    x0, .LC12                    ; load page address of "zero"
17
    add     x0, x0, :lo12:.LC12
18
    b       puts                         ; jump to puts
19

20
.L32:                                        ; two case
21
    adrp    x0, .LC14                    ; load page address of "two"
22
    add     x0, x0, :lo12:.LC14
23
    b       puts                         ; jump to puts
24

25
.L31:                                        ; one case
26
    adrp    x0, .LC13                    ; load page address of "one"
27
    add     x0, x0, :lo12:.LC13
28
    b       puts                         ; jump to puts

More optimized code. The CBZ (Compare and Branch on Zero) instruction jumps if W0 is zero.

There is also a direct jump to puts() instead of calling it, as explained before.

MIPS

1
f:
2
    lui     $gp, (__gnu_local_gp >> 16)  ; load upper 16 bits of global pointer
3

4
    ; is it 1?
5
    li      $v0, 1                       ; load immediate 1 into $v0
6
    beq     $a0, $v0, loc_60             ; if input == 1, jump to one case
7
    la      $gp, (__gnu_local_gp & 0xFFFF) ; load lower bits (delay slot)
8

9
    ; is it 2?
10
    li      $v0, 2                       ; load immediate 2
11
    beq     $a0, $v0, loc_4C             ; if input == 2, jump to two case
12
    or      $at, $zero                   ; NOP (delay slot)
13

14
    ; jump if not equal to 0
15
    bnez    $a0, loc_38                  ; if input != 0, jump to default
16
    or      $at, $zero                   ; NOP (delay slot)
17

18
    ; zero case
19
    lui     $a0, ($LC0 >> 16)            ; load upper bits of "zero" address
20
    lw      $t9, (puts & 0xFFFF)($gp)    ; load puts address from global pointer
21
    or      $at, $zero                   ; NOP (load delay slot)
22
    jr      $t9                          ; jump to puts (delay slot)
23
    la      $a0, ($LC0 & 0xFFFF)         ; load lower bits of "zero" (delay slot)
24

25
loc_38:                                      ; default case
26
    lui     $a0, ($LC3 >> 16)            ; load upper bits of "something unknown"
27
    lw      $t9, (puts & 0xFFFF)($gp)    ; load puts address
28
    or      $at, $zero                   ; NOP
29
    jr      $t9                          ; jump to puts
30
    la      $a0, ($LC3 & 0xFFFF)         ; load lower bits (delay slot)
31

32
loc_4C:                                      ; two case
33
    lui     $a0, ($LC2 >> 16)            ; load upper bits of "two"
34
    lw      $t9, (puts & 0xFFFF)($gp)    ; load puts address
35
    or      $at, $zero                   ; NOP
36
    jr      $t9                          ; jump to puts
37
    la      $a0, ($LC2 & 0xFFFF)         ; load lower bits (delay slot)
38

39
loc_60:                                      ; one case
40
    lui     $a0, ($LC1 >> 16)            ; load upper bits of "one"
41
    lw      $t9, (puts & 0xFFFF)($gp)    ; load puts address
42
    or      $at, $zero                   ; NOP
43
    jr      $t9                          ; jump to puts
44
    la      $a0, ($LC1 & 0xFFFF)         ; load lower bits (delay slot)

This function always ends with a call to puts(), so we see a direct jump to puts() (JR means Jump Register) instead of using "jump and link".

We also see many NOP instructions after LW instructions. This is called a load delay slot: another type of delay slot in MIPS.

The instruction after LW can execute simultaneously while LW is still loading the value from memory. But the instruction after that cannot use the result just loaded by LW.

Modern MIPS processors have a feature to stall if the next instruction uses the result of LW, so this issue is considered obsolete. But GCC still adds NOP instructions to support older MIPS processors.

1.22.2 A lot of cases

If the switch() statement has many cases, it is not convenient for the compiler to generate large code with many JE/JNE instructions.

1
#include <stdio.h>
2

3
void f (int a)
4
{
5
    switch (a)
6
    {
7
        case 0: printf ("zero\n"); break;
8
        case 1: printf ("one\n"); break;
9
        case 2: printf ("two\n"); break;
10
        case 3: printf ("three\n"); break;
11
        case 4: printf ("four\n"); break;
12
        default: printf ("something unknown\n"); break;
13
    };
14
};
15

16
int main()
17
{
18
    f (2); // test
19
};

x86: Non-optimizing MSVC

1
tv64 = -4                                ; temporary variable
2
_a$ = 8                                  ; parameter a offset
3
_f PROC
4
    push    ebp
5
    mov     ebp, esp
6
    push    ecx
7

8
    mov     eax, DWORD PTR _a$[ebp]      ; load a
9
    mov     DWORD PTR tv64[ebp], eax      ; store in temporary
10

11
    cmp     DWORD PTR tv64[ebp], 4       ; compare with maximal case value (4)
12
    ja      SHORT $LN1@f                 ; if greater, jump to default
13

14
    mov     ecx, DWORD PTR tv64[ebp]     ; load temporary into ECX
15
    jmp     DWORD PTR $LN11@f[ecx*4]     ; indirect jump using jump table
16

17
$LN6@f:                                      ; case 0
18
    push    OFFSET $SG739                ; "zero"
19
    call    _printf
20
    add     esp, 4
21
    jmp     SHORT $LN9@f
22

23
$LN5@f:                                      ; case 1
24
    push    OFFSET $SG741                ; "one"
25
    call    _printf
26
    add     esp, 4
27
    jmp     SHORT $LN9@f
28

29
$LN4@f:                                      ; case 2
30
    push    OFFSET $SG743                ; "two"
31
    call    _printf
32
    add     esp, 4
33
    jmp     SHORT $LN9@f
34

35
$LN3@f:                                      ; case 3
36
    push    OFFSET $SG745                ; "three"
37
    call    _printf
38
    add     esp, 4
39
    jmp     SHORT $LN9@f
40

41
$LN2@f:                                      ; case 4
42
    push    OFFSET $SG747                ; "four"
43
    call    _printf
44
    add     esp, 4
45
    jmp     SHORT $LN9@f
46

47
$LN1@f:                                      ; default case
48
    push    OFFSET $SG749                ; "something unknown"
49
    call    _printf
50
    add     esp, 4
51

52
$LN9@f:                                      ; function exit
53
    mov     esp, ebp
54
    pop     ebp
55
    ret     0
56

57
    npad    2                            ; align next label
58

59
$LN11@f:
60
    DD      $LN6@f                       ; table entry for case 0
61
    DD      $LN5@f                       ; case 1
62
    DD      $LN4@f                       ; case 2
63
    DD      $LN3@f                       ; case 3
64
    DD      $LN2@f                       ; case 4
65
_f ENDP

What we see here is a set of printf() calls with different arguments. Each not only has a memory address in the process, but also internal symbolic labels generated by the compiler. All these labels are also listed in an internal table named $LN11@f.

At the beginning of the function, if a is greater than 4, control flow is passed to label $LN1@f, where printf() is called with the argument "something unknown".

But if the value of a is less than or equal to 4, it is multiplied by 4 and added to the address of table $LN11@f. This forms an address inside the table that points exactly to the element we need.

For example, let us say a equals 2.

2 * 4 = 8 (each table element is an address in a 32-bit process, so each element is 4 bytes in size). The address of table $LN11@f + 8 is the table element that stores the label named $LN4@f. The JMP instruction fetches the address $LN4@f from the table and jumps there.

This table is sometimes called a jumptable or branch table.

Then the appropriate printf() is called with the argument "two".

Literally, the instruction:

1
jmp DWORD PTR $LN11@f[ecx*4]

means: jump to the DWORD stored at address $LN11@f + ecx * 4.

npad is a macro in assembly language that aligns the next label to a 4-byte (or 16-byte) boundary. This is very suitable for the processor because it can fetch 32-bit values from memory via the memory bus, cache memory, etc., more efficiently when they are aligned.

x32dbg

We run the same example in x32dbg after compiling and start running it in the debugger.

The input value of the function (2) is loaded into EAX:

Then the input value is checked: is it greater than 4? If not, the "default" jump is not taken:

Then we can view the jumptable by choosing Follow in Dump → Constant:

Now we see the jumptable in the data window. These are 5 32-bit values.

Now ECX is 2, so the third element (index 2) in the table will be used.

After the jump we are at 0x907218 — the code that prints "two" will now execute:

Non-optimizing GCC

Let us see what GCC 4.4.1 produces:

1
public f
2
f proc near
3
var_18 = dword ptr -18h
4
arg_0  = dword ptr  8
5

6
    push    ebp
7
    mov     ebp, esp
8
    sub     esp, 18h
9

10
    cmp     [ebp+arg_0], 4               ; compare input with 4
11
    ja      short loc_8048444            ; if greater, jump to default
12

13
    mov     eax, [ebp+arg_0]             ; load input
14
    shl     eax, 2                       ; multiply by 4 (shift left by 2)
15
    mov     eax, ds:off_804855C[eax]     ; load address from table
16
    jmp     eax                          ; jump to loaded address
17

18
loc_80483FE:                                 ; case 0
19
    mov     [esp+18h+var_18], offset aZero ; "zero"
20
    call    _puts
21
    jmp     short locret_8048450
22

23
loc_804840C:                                 ; case 1
24
    mov     [esp+18h+var_18], offset aOne ; "one"
25
    call    _puts
26
    jmp     short locret_8048450
27

28
loc_804841A:                                 ; case 2
29
    mov     [esp+18h+var_18], offset aTwo ; "two"
30
    call    _puts
31
    jmp     short locret_8048450
32

33
loc_8048428:                                 ; case 3
34
    mov     [esp+18h+var_18], offset aThree ; "three"
35
    call    _puts
36
    jmp     short locret_8048450
37

38
loc_8048436:                                 ; case 4
39
    mov     [esp+18h+var_18], offset aFour ; "four"
40
    call    _puts
41
    jmp     short locret_8048450
42

43
loc_8048444:                                 ; default case
44
    mov     [esp+18h+var_18], offset aSomethingUnkno ; "something unknown"
45
    call    _puts
46

47
locret_8048450:
48
    leave
49
    retn
50
f endp
51

52
off_804855C dd offset loc_80483FE            ; jump table
53
            dd offset loc_804840C
54
            dd offset loc_804841A
55
            dd offset loc_8048428
56
            dd offset loc_8048436

This is almost the same thing, with a small difference: the argument arg_0 is multiplied by 4 by shifting left by 2 bits (which is essentially the same as multiplying by 4), then the label address is taken from the array off_804855C, stored in EAX, and then JMP EAX performs the actual jump.

ARM: Optimizing Keil 6/2013 (ARM mode)

1
.text:00000174 f2
2
    CMP     R0, #5                       ; compare input with 5 (max case + 1)
3
    ADDCC   PC, PC, R0,LSL#2             ; if less than 5, add (input * 4) to PC
4
    B       default_case                 ; otherwise jump to default
5

6
loc_180:                                     ; case 0
7
    B       zero_case
8

9
loc_184:                                     ; case 1
10
    B       one_case
11

12
loc_188:                                     ; case 2
13
    B       two_case
14

15
loc_18C:                                     ; case 3
16
    B       three_case
17

18
loc_190:                                     ; case 4
19
    B       four_case
20

21
zero_case:
22
    ADR     R0, aZero                    ; load "zero"
23
    B       loc_1B8
24

25
one_case:
26
    ADR     R0, aOne                     ; load "one"
27
    B       loc_1B8
28

29
two_case:
30
    ADR     R0, aTwo                     ; load "two"
31
    B       loc_1B8
32

33
three_case:
34
    ADR     R0, aThree                   ; load "three"
35
    B       loc_1B8
36

37
four_case:
38
    ADR     R0, aFour                    ; load "four"
39

40
loc_1B8:
41
    B       __2printf                    ; call printf
42

43
default_case:
44
    ADR     R0, aSomethingUnkno          ; load "something unknown"
45
    B       loc_1B8                      ; jump to printf

This code exploits the fact that all instructions in ARM mode are fixed size (4 bytes).

Recall that the maximum value of a is 4 and any larger value will cause the string "something unknown\n" to be printed.

The first instruction CMP R0, #5 compares the input value of a with 5.

The next instruction ADDCC PC, PC, R0,LSL#2 executes only if R0 < 5 (CC = Carry clear / Less than).

Thus, if ADDCC did not execute (i.e., R0 ≥ 5 case), a jump to label default_case occurs.

But if R0 < 5 and ADDCC executed, what happens is: the value of R0 is multiplied by 4. In fact, LSL#2 at the end of the instruction means "shift left by 2 bits". But as we will see later in the "Shifts" section, shift left by 2 bits equals multiplication by 4.

Then R0 * 4 is added to the current value in PC, thus jumping to one of the B (Branch) instructions below.

At the moment of executing ADDCC, the value of PC is 8 bytes (0x180) ahead of the address of the ADDCC instruction itself (0x178), or in other words, two instructions ahead.

This is how the pipeline works in ARM processors: when ADDCC is executed, the processor is already processing the instruction two steps ahead, so PC points there. This point must be memorized.

If a = 0, 0 is added to the PC value, and the actual PC value (which is 8 bytes ahead) is written to PC, resulting in a jump to label loc_180, which is 8 bytes ahead of where the ADDCC instruction is.

If a = 1: PC + 8 + a*4 = PC + 8 + 4 = PC + 12 = 0x184, which is the address of label loc_184.

With each increment of a, the resulting PC value increases by 4. And 4 is the length of an instruction in ARM mode, and also the length of each B instruction, of which there are 5 in a row.

Each of these five B instructions transfers control forward to what is programmed in the switch().

Loading the appropriate string pointer happens there, etc.

ARM: Optimizing Keil 6/2013 (Thumb mode)

1
.text:000000F6 f2
2
    PUSH    {R4,LR}
3
    MOVS    R3, R0
4
    BL      __ARM_common_switch8_thumb   ; call helper function
5
    DCB     5                            ; number of cases (excluding default)
6
    DCB     4, 6, 8, 0xA, 0xC, 0x10      ; offsets for each case
7

8
zero_case:
9
    ADR     R0, aZero
10
    B       loc_118
11

12
one_case:
13
    ADR     R0, aOne
14
    B       loc_118
15

16
two_case:
17
    ADR     R0, aTwo
18
    B       loc_118
19

20
three_case:
21
    ADR     R0, aThree
22
    B       loc_118
23

24
four_case:
25
    ADR     R0, aFour
26

27
loc_118:
28
    BL      __2printf
29
    POP     {R4,PC}
30

31
default_case:
32
    ADR     R0, aSomethingUnkno
33
    B       loc_118

It is not possible to be sure that all instructions in Thumb and Thumb-2 have the same size.

One could also say that in these modes instructions have variable length, like in x86.

Therefore, a special table is added containing information about the number of cases (excluding default-case) and also the offset for each case with the label to which control should go in the appropriate case.

There is a special function here to handle the table and transfer control, named __ARM_common_switch8_thumb. It starts with BX PC, whose purpose is to switch the processor to ARM mode.

Then the function responsible for handling the table is executed. It is too advanced to explain here now, so let us leave it.

Interestingly, the function uses the LR register as a pointer to the table.

Indeed, after calling this function, LR contains the address after the instruction BL __ARM_common_switch8_thumb, which is where the table starts.

It is also noteworthy that the code is generated as a separate reusable function, meaning the compiler does not emit the same code for each switch().

IDA successfully understood it as a service function and table, and added comments to the labels like: jumptable 000000FA case 0.

MIPS

1
f:
2
    lui     $gp, (__gnu_local_gp >> 16)
3

4
    sltiu   $v0, $a0, 5                  ; set $v0 to 1 if input < 5
5
    bnez    $v0, loc_24                  ; if true, jump to table handling
6
    la      $gp, (__gnu_local_gp & 0xFFFF) ; branch delay slot
7

8
    ; input >= 5: default case
9
    lui     $a0, ($LC5 >> 16)            ; load "something unknown"
10
    lw      $t9, (puts & 0xFFFF)($gp)
11
    or      $at, $zero                   ; NOP
12
    jr      $t9                          ; call puts
13
    la      $a0, ($LC5 & 0xFFFF)         ; delay slot
14

15
loc_24:
16
    la      $v0, off_120                 ; load address of jump table
17

18
    sll     $a0, 2                       ; multiply input by 4
19
    addu    $a0, $v0, $a0                ; add to table base
20

21
    lw      $v0, 0($a0)                  ; load target address from table
22
    or      $at, $zero                   ; NOP
23

24
    jr      $v0                          ; jump to target
25
    or      $at, $zero                   ; delay slot, NOP
26

27
sub_44:                                      ; case 3
28
    lui     $a0, ($LC3 >> 16)            ; "three"
29
    lw      $t9, (puts & 0xFFFF)($gp)
30
    or      $at, $zero
31
    jr      $t9
32
    la      $a0, ($LC3 & 0xFFFF)
33

34
sub_58:                                      ; case 4
35
    lui     $a0, ($LC4 >> 16)            ; "four"
36
    lw      $t9, (puts & 0xFFFF)($gp)
37
    or      $at, $zero
38
    jr      $t9
39
    la      $a0, ($LC4 & 0xFFFF)
40

41
sub_6C:                                      ; case 0
42
    lui     $a0, ($LC0 >> 16)            ; "zero"
43
    lw      $t9, (puts & 0xFFFF)($gp)
44
    or      $at, $zero
45
    jr      $t9
46
    la      $a0, ($LC0 & 0xFFFF)
47

48
sub_80:                                      ; case 1
49
    lui     $a0, ($LC1 >> 16)            ; "one"
50
    lw      $t9, (puts & 0xFFFF)($gp)
51
    or      $at, $zero
52
    jr      $t9
53
    la      $a0, ($LC1 & 0xFFFF)
54

55
sub_94:                                      ; case 2
56
    lui     $a0, ($LC2 >> 16)            ; "two"
57
    lw      $t9, (puts & 0xFFFF)($gp)
58
    or      $at, $zero
59
    jr      $t9
60
    la      $a0, ($LC2 & 0xFFFF)
61

62
off_120:
63
    .word   sub_6C                       ; case 0
64
    .word   sub_80                       ; case 1
65
    .word   sub_94                       ; case 2
66
    .word   sub_44                       ; case 3
67
    .word   sub_58                       ; case 4

The new instruction for us here is SLTIU ("Set on Less Than Immediate Unsigned"). It is the same as SLTU ("Set on Less Than Unsigned"), but the "I" means "immediate", i.e., a number is written directly in the instruction.

BNEZ means "Branch if Not Equal to Zero". The code is very close to other ISAs.

SLL ("Shift Word Left Logical") multiplies by 4. MIPS is ultimately a 32-bit CPU, so all addresses in the jumptable are 32-bit addresses.

Conclusion

The general structure of switch():

1
MOV     REG, input
2
CMP     REG, 4                       ; maximal number of cases
3
JA      default
4
SHL     REG, 2                       ; shift for multiplication by 4 (x64 uses 3 bits)
5
MOV     REG, jump_table[REG]
6
JMP     REG
7

8
case1:
9
    ; do something
10
    JMP     exit
11

12
case2:
13
    ; do something
14
    JMP     exit
15

16
case3:
17
    ; do something
18
    JMP     exit
19

20
case4:
21
    ; do something
22
    JMP     exit
23

24
case5:
25
    ; do something
26
    JMP     exit
27

28
default:
29
    ...
30

31
exit:
32
    ....
33

34
jump_table dd case1
35
           dd case2
36
           dd case3
37
           dd case4
38
           dd case5

The jump to the address in the jump table can also be done using the instruction:

JMP jump_table[REG*4] or JMP jump_table[REG*8] in x64.

And the jumptable is just an array of pointers.

0xV3n0m

1.22.1 Small number of cases

x86: Non-optimizing MSVC

Optimizing MSVC

x32dbg (EX2)

ARM: Optimizing Keil 6/2013 (ARM mode)

ARM: Optimizing Keil 6/2013 (Thumb mode)

ARM64: Non-optimizing GCC (Linaro) 4.9

ARM64: Optimizing GCC (Linaro) 4.9

MIPS

1.22.2 A lot of cases

x86: Non-optimizing MSVC

x32dbg

Non-optimizing GCC

ARM: Optimizing Keil 6/2013 (ARM mode)

ARM: Optimizing Keil 6/2013 (Thumb mode)

MIPS

Conclusion

Table of Contents