1.21 Loops

x86

The author began this section by saying that there is a special instruction called LOOP in the x86 instruction set.

This instruction checks the value in the ECX register; if it is not zero, it decrements ECX by one and transfers control to the label specified in the LOOP operand.

Generally, this instruction is not very convenient, and modern compilers do not generate it automatically.

Therefore, if you see this instruction in code, it is highly likely that the code is hand-written Assembly.

In C/C++, loops are usually implemented using:

for()
while()
do/while()

Let us start with for().

This statement specifies:

Loop initialization (setting the initial value of the counter)
Loop condition (is the counter greater or less than a certain limit?)
What happens at each iteration (increment or decrement)
And of course, the loop body

The general form:

1
for (initialization; condition; at_each_iteration) // general form of for loop
2
{
3
    loop_body; // the body of the loop
4
};

The generated code also consists of four parts.

Let us start with a simple example:

1
#include <stdio.h> // include standard I/O header file
2

3
void printing_function(int i) // define function that takes int parameter i
4
{
5
    printf ("f(%d)\n", i); // print "f(value_of_i)" followed by newline
6
};
7

8
int main() // program entry point
9
{
10
    int i; // declare integer variable i
11
    for (i=2; i<10; i++) // loop: start i=2, continue while i<10, increment i each iteration
12
        printing_function(i); // call printing_function with current i
13
    return 0; // return 0 to indicate successful termination
14
};

Result (MSVC 2010):

1
_i$ = -4                                 ; offset of local variable i on stack
2
_main PROC
3
    push    ebp                          ; save base pointer (standard prologue)
4
    mov     ebp, esp                     ; set up stack frame
5
    push    ecx                          ; allocate 4 bytes for local variable i
6

7
    mov     DWORD PTR _i$[ebp], 2        ; initialization: set i = 2
8
    jmp     SHORT $LN3@main              ; jump to condition check
9

10
$LN2@main:
11
    mov     eax, DWORD PTR _i$[ebp]      ; load current i into EAX
12
    add     eax, 1                       ; increment i
13
    mov     DWORD PTR _i$[ebp], eax      ; store incremented value back to i
14

15
$LN3@main:
16
    cmp     DWORD PTR _i$[ebp], 10       ; compare i with 10
17
    jge     SHORT $LN1@main              ; if i >= 10, exit loop
18

19
    mov     ecx, DWORD PTR _i$[ebp]      ; load i into ECX (argument for call)
20
    push    ecx                          ; push argument onto stack
21
    call    _printing_function           ; call printing_function
22
    add     esp, 4                       ; clean up stack (remove argument)
23

24
    jmp     SHORT $LN2@main              ; jump back to increment part
25

26
$LN1@main:                               ; loop exit point
27
    xor     eax, eax                     ; set return value to 0
28
    mov     esp, ebp                     ; restore ESP
29
    pop     ebp                          ; restore EBP (standard epilogue)
30
    ret     0                            ; return
31
_main ENDP

There is nothing particularly strange; I have marked the important parts in the code.

GCC 4.4.1 produces almost the same code, with only a small difference:

1
main proc near
2
var_20 = dword ptr -20h
3
var_4  = dword ptr -4
4

5
    push    ebp                          ; save EBP
6
    mov     ebp, esp                     ; set up stack frame
7
    and     esp, 0FFFFFFF0h              ; align stack to 16-byte boundary
8
    sub     esp, 20h                     ; allocate local space
9

10
    mov     [esp+20h+var_4], 2           ; initialization: i = 2
11
    jmp     short loc_8048476            ; jump to condition check
12

13
loc_8048465:
14
    mov     eax, [esp+20h+var_4]         ; load i
15
    mov     [esp+20h+var_20], eax        ; pass i as argument
16
    call    printing_function            ; call function
17
    add     [esp+20h+var_4], 1           ; i++
18

19
loc_8048476:
20
    cmp     [esp+20h+var_4], 9           ; compare i with 9
21
    jle     short loc_8048465            ; if i <= 9, continue loop
22

23
    mov     eax, 0                       ; return value 0
24
    leave                                ; restore stack frame
25
    retn                                 ; return
26
main endp

1
_main PROC
2
    push    esi                          ; save ESI (callee-saved)
3
    mov     esi, 2                       ; i = 2 (using register)
4

5
$LL3@main:
6
    push    esi                          ; push i as argument
7
    call    _printing_function           ; call function
8
    inc     esi                          ; i++
9
    add     esp, 4                       ; clean up stack
10

11
    cmp     esi, 10                      ; compare i with 10
12
    jl      SHORT $LL3@main              ; if i < 10, continue loop
13

14
    xor     eax, eax                     ; return value 0
15
    pop     esi                          ; restore ESI
16
    ret     0                            ; return
17
_main ENDP

Now let us see what happens when we enable optimization (/Ox):

What happened here is slightly different: no stack space was allocated for the variable i, and the ESI register was used exclusively for it. This is possible in small functions with few local variables.

Another very important point: the printing_function() function must not change the value of ESI. The compiler is confident of this. If the compiler decided to use ESI inside printing_function(), it would have to save its value at the beginning and restore it at the end, as we saw with PUSH ESI / POP ESI.

Let us try GCC 4.4.1 with maximum optimization (-O3):

1
main proc near
2
var_10 = dword ptr -10h
3

4
    push    ebp                          ; save EBP
5
    mov     ebp, esp                     ; set up frame
6
    and     esp, 0FFFFFFF0h              ; align stack
7
    sub     esp, 10h                     ; allocate space
8

9
    mov     [esp+10h+var_10], 2          ; pass 2
10
    call    printing_function            ; call with 2
11
    mov     [esp+10h+var_10], 3          ; pass 3
12
    call    printing_function            ; call with 3
13
    mov     [esp+10h+var_10], 4          ; pass 4
14
    call    printing_function            ; call with 4
15
    mov     [esp+10h+var_10], 5          ; pass 5
16
    call    printing_function            ; call with 5
17
    mov     [esp+10h+var_10], 6          ; pass 6
18
    call    printing_function            ; call with 6
19
    mov     [esp+10h+var_10], 7          ; pass 7
20
    call    printing_function            ; call with 7
21
    mov     [esp+10h+var_10], 8          ; pass 8
22
    call    printing_function            ; call with 8
23
    mov     [esp+10h+var_10], 9          ; pass 9
24
    call    printing_function            ; call with 9
25

26
    xor     eax, eax                     ; return 0
27
    leave                                ; restore frame
28
    retn                                 ; return
29
main endp

Here the situation changed: the loop was completely unrolled (loop unrolling). This has the advantage of saving execution time by removing loop control instructions, but at the cost of larger code size.

Large fully unrolled loops are not recommended nowadays because large functions require more instruction cache space.

Let us increase the upper limit of i to 100 and try again.

GCC produces:

1
public main
2
main proc near
3
var_20 = dword ptr -20h
4

5
    push    ebp
6
    mov     ebp, esp
7
    and     esp, 0FFFFFFF0h
8
    push    ebx                          ; save EBX
9

10
    mov     ebx, 2                       ; i = 2
11
    sub     esp, 1Ch
12

13
    nop                                  ; alignment
14

15
loc_80484D0:
16
    mov     [esp+20h+var_20], ebx        ; pass i as argument
17
    add     ebx, 1                       ; i++
18
    call    printing_function            ; call function
19

20
    cmp     ebx, 64h                     ; i == 100?
21
    jnz     short loc_80484D0            ; continue loop if not
22

23
    add     esp, 1Ch
24
    xor     eax, eax                     ; return 0
25
    pop     ebx                          ; restore EBX
26
    mov     esp, ebp
27
    pop     ebp
28
    retn
29
main endp

This is very similar to what MSVC 2010 produces with optimization, except the EBX register is used as the counter i. GCC is confident that the register will not be changed inside printing_function(); otherwise it would save and restore it at the function entry/exit, as happened here in main().

x86: x32dbg

The author also used OllyDbg, but we will run this example in x32dbg as well.

We start by compiling the C code with the command:

1
cl /Ox /Ob0 /W3 /Fe:loop.exe test.c

We load the resulting EXE into x32dbg and go to main:

x32dbg showing main with detected loop and arrow on the right

At this point it is clear that x32dbg has detected the loop and shows the arrow on the right.

We start pressing F8 (Step Over) and notice that ESI increases by one each time:

x32dbg stepping through loop, ESI increasing

The last value the loop takes is 9, so after the increment the JL instruction is not executed and the function finishes:

x32dbg at loop exit after last iteration

x86: tracer

The author said that manually tracing inside a debugger is not very convenient, so we will try using tracer.

First, we open the example in IDA and find the address of the PUSH ESI instruction (which passes the only argument to printing_function()):

IDA showing PUSH ESI instruction at address 0x401026

The address in this case is 0x401026.

We then run tracer (link to the tool used in the book: https://yurichev.com/tracer-en.html):

1
PS D:\Project> ./tracer.exe -l:loop.exe bpx=loop.exe!0x00401026

The -bpx option sets a breakpoint only at that address, and tracer then prints the register state.

In the tracer.log file we see:

1
Warning: no tracer.cfg file.
2
PID=18080|New process loop.exe
3
Warning: unknown (to us) INT3 breakpoint at ntdll.dll!LdrInitShimEngineDynamic+0x6e2 (0x77201ba2)
4
(0) loop.exe!0x401026
5
EAX=0x0022ce88 EBX=0x004c7000 ECX=0x00000000 EDX=0xc3de76bb
6
ESI=0x00000002 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
7
EIP=0x00211026
8
FLAGS=PF IF
9
(0) loop.exe!0x401026
10
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
11
ESI=0x00000003 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
12
EIP=0x00211026
13
FLAGS=CF PF AF SF IF
14
(0) loop.exe!0x401026
15
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
16
ESI=0x00000004 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
17
EIP=0x00211026
18
FLAGS=CF PF AF SF IF
19
(0) loop.exe!0x401026
20
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
21
ESI=0x00000005 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
22
EIP=0x00211026
23
FLAGS=CF AF SF IF
24
(0) loop.exe!0x401026
25
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
26
ESI=0x00000006 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
27
EIP=0x00211026
28
FLAGS=CF PF AF SF IF
29
(0) loop.exe!0x401026
30
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
31
ESI=0x00000007 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
32
EIP=0x00211026
33
FLAGS=CF AF SF IF
34
(0) loop.exe!0x401026
35
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
36
ESI=0x00000008 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
37
EIP=0x00211026
38
FLAGS=CF AF SF IF
39
(0) loop.exe!0x401026
40
EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415
41
ESI=0x00000009 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4
42
EIP=0x00211026
43
FLAGS=CF PF AF SF IF
44
PID=18080|Process loop.exe exited. ExitCode=0 (0x0)

We can see that the value of the ESI register changes from 2 to 9.

Moreover, tracer can collect register values for all addresses inside the function—this is called a trace.

It then generates an IDA .idc script that adds comments.

Since we know in IDA that the address of the main() function is 0x00401020, we run:

1
PS D:\Project> .\tracer.exe -l:loop.exe bpf=loop.exe!0x00401020,trace:cc

BPF means: set a breakpoint on the entire function.

As a result, two scripts are produced:

1
loops_2.exe.idc
2
loops_2.exe_clear.idc

We load loops_2.exe.idc into IDA and see:

IDA with tracer-generated comments showing ESI values from 2 to 9 and after increment 3 to 10

We notice that ESI can be from 2 to 9 at the start of the loop body, but after the increment it becomes from 3 to 0xA (10).

We also see that main() finishes with EAX = 0. Tracer also produces a file loops_2.exe.txt.

This file contains information about the number of times each instruction was executed and register values:

1
0x401020 (BASE+0x1020), e=       1 [PUSH ESI] ESI=0xbc7d70
2
0x401021 (BASE+0x1021), e=       1 [MOV ESI, 2]
3
0x401026 (BASE+0x1026), e=       8 [PUSH ESI] ESI=2..9
4
0x401027 (BASE+0x1027), e=       8 [CALL 0B41000h] tracing nested maximum level (1) reached, skipping this CALL 0B41000h=0xb41000
5
0x40102c (BASE+0x102c), e=       8 [INC ESI] ESI=2..9
6
0x40102d (BASE+0x102d), e=       8 [ADD ESP, 4] ESP=0x79f9b8
7
0x401030 (BASE+0x1030), e=       8 [CMP ESI, 0Ah] ESI=3..0xa
8
0x401033 (BASE+0x1033), e=       8 [JL 0B41026h] SF=false,true OF=false
9
0x401035 (BASE+0x1035), e=       1 [POP ESI]
10
0x401036 (BASE+0x1036), e=       1 [XOR EAX, EAX]
11
0x401038 (BASE+0x1038), e=       1 [RETN] EAX=0

Here we can use grep, etc.

ARM

Keil 6/2013: ARM mode (no optimization)

1
main
2
    STMFD   SP!, {R4,LR}                 ; save R4 and link register
3
    MOV     R4, #2                       ; i = 2
4
    B       loc_368                      ; jump to condition check
5

6
loc_35C                                      ; CODE XREF: main+1C
7
    MOV     R0, R4                       ; prepare argument (i)
8
    BL      printing_function            ; call printing_function
9
    ADD     R4, R4, #1                   ; i++
10

11
loc_368                                      ; CODE XREF: main+8
12
    CMP     R4, #0xA                     ; compare i with 10
13
    BLT     loc_35C                      ; if i < 10, continue loop
14
    MOV     R0, #0                       ; return value 0
15
    LDMFD   SP!, {R4,PC}                 ; restore R4 and return

Here the loop counter i is stored in R4. The MOV R4, #2 instruction only performs the initialization of i.

The instructions MOV R0, R4 and BL printing_function form the loop body: the first prepares the argument for printing_function(), the second calls it.

The ADD R4, R4, #1 instruction increments i at each iteration.

The CMP R4, #0xA instruction compares i with 10 (0xA).

The following BLT (Branch Less Than) instruction jumps if i is less than 10.

Otherwise, 0 is written to R0 (return value) and execution finishes.

Optimizing Keil 6/2013: Thumb mode

1
_main
2
    PUSH    {R4,LR}                      ; save R4 and LR
3
    MOVS    R4, #2                       ; i = 2
4

5
loc_132                                      ; CODE XREF: _main+E
6
    MOVS    R0, R4                       ; prepare argument
7
    BL      printing_function            ; call function
8
    ADDS    R4, R4, #1                   ; i++
9
    CMP     R4, #0xA                     ; compare i with 10
10
    BLT     loc_132                      ; if i < 10, continue
11
    MOVS    R0, #0                       ; return value 0
12
    POP     {R4,PC}                      ; restore R4 and return

Same as before, no difference.

Optimizing Xcode 4.6.3 (LLVM): Thumb-2 mode

1
_main
2
    PUSH    {R4,R7,LR}
3
    MOVW    R4, #0x1124                  ; address of format string "%d\n"
4
    MOVS    R1, #2
5
    MOVT.W  R4, #0
6
    ADD     R7, SP, #4
7
    ADD     R4, PC                       ; R4 = effective address of string
8

9
    MOV     R0, R4
10
    BLX     _printf                      ; print 2
11
    MOV     R0, R4
12
    MOVS    R1, #3
13
    BLX     _printf                      ; print 3
14
    MOV     R0, R4
15
    MOVS    R1, #4
16
    BLX     _printf                      ; print 4
17
    MOV     R0, R4
18
    MOVS    R1, #5
19
    BLX     _printf                      ; print 5
20
    MOV     R0, R4
21
    MOVS    R1, #6
22
    BLX     _printf                      ; print 6
23
    MOV     R0, R4
24
    MOVS    R1, #7
25
    BLX     _printf                      ; print 7
26
    MOV     R0, R4
27
    MOVS    R1, #8
28
    BLX     _printf                      ; print 8
29
    MOV     R0, R4
30
    MOVS    R1, #9
31
    BLX     _printf                      ; print 9
32

33
    MOVS    R0, #0
34
    POP     {R4,R7,PC}

In fact, this was the content of the printing_function() in my case:

1
void printing_function(int i)
2
{
3
    printf ("%d\n", i);
4
};

So LLVM not only unrolled the loop, but also inlined the simple printing_function(), placing its body 8 times instead of calling it. This is possible when the function is very simple and called a small number of times (as here).

ARM64: Optimizing GCC 4.9.1

1
printing_function:
2
    mov     w1, w0                       ; move argument to W1
3
    adrp    x0, .LC0                     ; load page address of string
4
    add     x0, x0, :lo12:.LC0            ; add low 12 bits offset
5
    b       printf                       ; jump to printf
6

7
main:
8
    stp     x29, x30, [sp, -32]!          ; save frame pointer and link register
9
    add     x29, sp, 0                   ; set frame pointer
10
    str     x19, [sp,16]                 ; save X19 (callee-saved)
11
    mov     w19, 2                       ; i = 2 (using callee-saved X19)
12

13
.L3:
14
    mov     w0, w19                      ; prepare argument
15
    add     w19, w19, 1                  ; i++
16
    bl      printing_function            ; call function
17
    cmp     w19, 10                      ; compare i with 10
18
    bne     .L3                          ; if i != 10, continue loop
19

20
    mov     w0, 0                        ; return value 0
21
    ldr     x19, [sp,16]                 ; restore X19
22
    ldp     x29, x30, [sp], 32           ; restore FP/LR and deallocate
23
    ret                                  ; return
24

25
.LC0:
26
    .string "f(%d)\n"

ARM64: Non-optimizing GCC 4.9.1

1
.LC0:
2
    .string "f(%d)\n"
3

4
printing_function:
5
    stp     x29, x30, [sp, -32]!          ; save FP and LR
6
    add     x29, sp, 0                   ; set frame pointer
7
    str     w0, [x29,28]                 ; store argument on stack
8
    adrp    x0, .LC0                     ; load string page
9
    add     x0, x0, :lo12:.LC0            ; add offset
10
    ldr     w1, [x29,28]                 ; load argument
11
    bl      printf                       ; call printf
12
    ldp     x29, x30, [sp], 32           ; restore and deallocate
13
    ret                                  ; return
14

15
main:
16
    stp     x29, x30, [sp, -32]!          ; save FP and LR
17
    add     x29, sp, 0                   ; set frame pointer
18
    mov     w0, 2                        ; load 2
19
    str     w0, [x29,28]                 ; store i = 2
20
    b       .L3                          ; jump to condition check
21

22
.L4:
23
    ldr     w0, [x29,28]                 ; load i
24
    bl      printing_function            ; call function
25
    ldr     w0, [x29,28]                 ; load i
26
    add     w0, w0, 1                    ; i++
27
    str     w0, [x29,28]                 ; store new i
28

29
.L3:
30
    ldr     w0, [x29,28]                 ; load i
31
    cmp     w0, 9                        ; compare i with 9
32
    ble     .L4                          ; if i <= 9, continue loop
33
    mov     w0, 0                        ; return 0
34
    ldp     x29, x30, [sp], 32           ; restore and deallocate
35
    ret                                  ; return

MIPS

Listing 1.176: Non-optimizing GCC 4.4.5 (IDA)

1
main:
2
    addiu   $sp, -0x28                   ; allocate stack frame (40 bytes)
3
    sw      $ra, 0x28-4($sp)             ; save return address
4
    sw      $fp, 0x28-8($sp)             ; save frame pointer
5
    move    $fp, $sp                     ; set frame pointer
6

7
    li      $v0, 2                       ; load immediate 2 into $v0
8
    sw      $v0, 0x28-0x10($fp)          ; store i = 2
9

10
    b       loc_9C                       ; jump to condition check
11
    or      $at, $zero                   ; NOP (delay slot)
12

13
loc_80:
14
    lw      $a0, 0x28-0x10($fp)          ; load i into $a0 (argument)
15
    jal     printing_function            ; call printing_function
16
    or      $at, $zero                   ; NOP (delay slot)
17

18
    lw      $v0, 0x28-0x10($fp)          ; load i
19
    addiu   $v0, 1                       ; i++
20
    sw      $v0, 0x28-0x10($fp)          ; store new i
21

22
loc_9C:
23
    lw      $v0, 0x28-0x10($fp)          ; load i
24
    slti    $v0, 0xA                     ; set $v0 to 1 if i < 10
25
    bnez    $v0, loc_80                  ; if true, continue loop
26
    or      $at, $zero                   ; NOP (delay slot)
27

28
    move    $v0, $zero                   ; return value 0
29
    move    $sp, $fp                     ; restore stack pointer
30
    lw      $ra, 0x28-4($sp)             ; restore return address
31
    lw      $fp, 0x28-8($sp)             ; restore frame pointer
32
    addiu   $sp, 0x28                    ; deallocate stack frame
33
    jr      $ra                          ; return
34
    or      $at, $zero                   ; NOP (delay slot)

The only difference here is the new b instruction, which is generally a pseudo-instruction for BEQ.

Final note

In the generated code we notice that after initializing i, the loop body is not executed immediately—the condition is checked first, and only then the loop body may execute. This is correct, because if the condition is not met from the beginning, the loop body should not execute at all.

For example:

1
for (i=0; i<total_entries_to_process; i++)
2
    loop_body;

Therefore the condition is checked before execution. However, with optimization the compiler may rearrange the order of condition check and loop body if it is sure that this case cannot occur (as in our simple example, and with compilers like Keil, Xcode (LLVM), MSVC with optimization).

1.21.2 Memory blocks copying routine

In real-world practice, memory copy routines often copy 4 or 8 bytes per iteration and may use SIMD, vectorization, and other advanced techniques.

To keep things simple, this example is the simplest possible form:

1
#include <stdio.h> // include standard I/O header
2

3
void my_memcpy (unsigned char* dst, unsigned char* src, size_t cnt) // define byte-by-byte memory copy function
4
{
5
    size_t i; // declare counter i
6
    for (i=0; i<cnt; i++) // loop from 0 to cnt-1
7
        dst[i]=src[i]; // copy byte from source to destination
8
};

1
my_memcpy:                               ; Listing 1.177: GCC 4.9 x64 optimized for size (-Os)
2
    ; RDI = destination address
3
    ; RSI = source address
4
    ; RDX = block size
5
    xor     eax, eax                     ; initialize counter i to 0 (EAX = 0)
6

7
.L2:
8
    cmp     rax, rdx                     ; have all bytes been copied? if yes, exit
9
    je      .L5                          ; jump to exit if done
10

11
    mov     cl, BYTE PTR [rsi+rax]       ; load byte from source + i into CL
12
    mov     BYTE PTR [rdi+rax], cl       ; store byte into destination + i
13

14
    inc     rax                          ; i++
15
    jmp     .L2                          ; repeat loop
16

17
.L5:
18
    ret                                  ; return

1
my_memcpy:                               ; Listing 1.178: GCC 4.9 ARM64 optimized for size (-Os)
2
    ; X0 = destination address
3
    ; X1 = source address
4
    ; X2 = block size
5
    mov     x3, 0                        ; initialize counter i to 0
6

7
.L2:
8
    cmp     x3, x2                       ; are we done copying?
9
    beq     .L5                          ; if yes, exit
10

11
    ldrb    w4, [x1,x3]                  ; load byte from source + i into W4
12
    strb    w4, [x0,x3]                  ; store byte into destination + i
13

14
    add     x3, x3, 1                    ; i++
15
    b       .L2                          ; repeat loop
16

17
.L5:
18
    ret                                  ; return

1
my_memcpy PROC                           ; Listing 1.179: Optimizing Keil 6/2013 (Thumb mode)
2
    ; R0 = destination address
3
    ; R1 = source address
4
    ; R2 = block size
5
    PUSH    {r4,lr}                      ; save R4 and LR
6

7
    MOVS    r3,#0                        ; initialize counter i to 0
8

9
    B       |L0.12|                      ; jump to condition check (at end of loop)
10

11
|L0.6|
12
    LDRB    r4,[r1,r3]                   ; load byte from source + i into R4
13
    STRB    r4,[r0,r3]                   ; store byte into destination + i
14
    ADDS    r3,r3,#1                     ; i++
15

16
|L0.12|
17
    CMP     r3,r2                        ; i < size?
18
    BCC     |L0.6|                       ; if yes, continue loop
19

20
    POP     {r4,pc}                      ; restore R4 and return
21
ENDP

ARM – ARM mode

Keil in ARM mode fully exploits conditional suffixes.

1
my_memcpy PROC
2
    ; R0 = destination address
3
    ; R1 = source address
4
    ; R2 = block size
5
    MOV     r3,#0                        ; initialize counter i to 0
6

7
|L0.4|
8
    CMP     r3,r2                        ; have all bytes been copied?
9

10
    LDRBCC  r12,[r1,r3]                  ; load byte from source + i (conditional: only if i < size)
11
    STRBCC  r12,[r0,r3]                  ; store byte into destination + i (conditional)
12
    ADDCC   r3,r3,#1                     ; i++ (conditional)
13

14
    BCC     |L0.4|                       ; jump back to loop start if i < size
15

16
    BX      lr                           ; return
17
ENDP

Thus there is only one jump instead of two.

MIPS

1
my_memcpy:                               ; Listing 1.181: GCC 4.4.5 optimized for size (-Os) (IDA)
2
    b       loc_14                       ; jump to condition check
3

4
    move    $v0, $zero                   ; initialize counter i to 0 (delay slot)
5

6
loc_8:                                   ; CODE XREF: my_memcpy+1C
7
    lbu     $v1, 0($t0)                  ; load unsigned byte from source address in $t0 into $v1
8

9
    addiu   $v0, 1                       ; i++
10

11
    sb      $v1, 0($a3)                  ; store byte into destination address in $a3
12

13
loc_14:                                  ; CODE XREF: my_memcpy
14
    sltu    $v1, $v0, $a2                ; set $v1 to 1 if i < cnt
15

16
    addu    $t0, $a1, $v0                ; compute source + i address
17

18
    bnez    $v1, loc_8                   ; if i < cnt, continue loop
19

20
    addu    $a3, $a0, $v0                ; compute destination + i address (delay slot)
21

22
    jr      $ra                          ; return
23
    or      $at, $zero                   ; NOP (delay slot)

Here are some instructions:

LBU: Load Byte Unsigned (zero-extends the rest of the bits)
LB: Load Byte with sign extension
SB: Store Byte (lowest 8 bits)

Like ARM, all MIPS registers are 32-bit, so even when working with a single byte, a full 32-bit register must be used.

Vectorization

An optimized GCC version can do much more with this example, which will be explained later.

1.21.3 Condition check

The author explained that it is very important to remember that in the for() construct, the condition is not checked at the end, but at the beginning before the loop body executes. However, in many cases it is easier for the compiler to check the condition at the end after the loop body, sometimes adding an extra check at the beginning.

An example of this:

1
#include <stdio.h>
2

3
void f(int start, int finish)
4
{
5
    for (; start<finish; start++)
6
        printf ("%d\n", start);
7
};

GCC 5.4.0 x64 – Optimized

1
f:
2
    cmp     edi, esi                     ; condition check (1)
3
    jge     .L9                          ; if start >= finish, exit
4

5
    push    rbp
6
    push    rbx
7
    mov     ebp, esi                     ; save finish
8
    mov     ebx, edi                     ; save start
9
    sub     rsp, 8
10

11
.L5:
12
    mov     edx, ebx                     ; prepare argument (current value)
13
    xor     eax, eax
14
    mov     esi, OFFSET FLAT:.LC0        ; "%d\n"
15
    mov     edi, 1
16
    add     ebx, 1                       ; increment current value
17
    call    __printf_chk                 ; print
18

19
    cmp     ebp, ebx                     ; condition check (2)
20
    jne     .L5                          ; continue if current != finish
21

22
    add     rsp, 8
23
    pop     rbx
24
    pop     rbp
25

26
.L9:
27
    rep ret                              ; return

Here we see two condition checks.

Hex-Rays (at least version 2.2.0) decompiles it as:

1
void __cdecl f(unsigned int start, unsigned int finish)
2
{
3
    unsigned int v2; // ebx@2
4
    __int64 v3;      // rdx@3
5

6
    if ( (signed int)start < (signed int)finish )
7
    {
8
        v2 = start;
9
        do
10
        {
11
            v3 = v2++;
12
            _printf_chk(1LL, "%d\n", v3);
13
        }
14
        while ( finish != v2 );
15
    }
16
}

In this case, we can confidently replace the do/while() with for() and also remove the first condition check.

1.22.4 Conclusion

General skeleton for a loop from 2 to 9 inclusive:

Listing 1.182: x86

1
mov     [counter], 2                 ; initialization
2
jmp     check                        ; jump to condition check
3

4
body:
5
    ; loop body
6
    ; do whatever you want here
7
    ; use the counter variable from the stack
8
    add     [counter], 1             ; increment
9

10
check:
11
    cmp     [counter], 9             ; compare with 9
12
    jle     body                     ; if <= 9, continue loop

The increment can be done with 3 instructions in non-optimized code:

Listing 1.183: x86

1
MOV     [counter], 2                 ; initialization
2
JMP     check
3

4
body:
5
    ; loop body
6
    ; do whatever you want here
7
    ; use the counter variable from the stack
8

9
    MOV     REG, [counter]           ; increment
10
    INC     REG
11
    MOV     [counter], REG
12

13
check:
14
    CMP     [counter], 9
15
    JLE     body

If the loop body is small, we can dedicate a full register to the counter:

Listing 1.184: x86

1
MOV     EBX, 2                       ; initialization
2
JMP     check
3

4
body:
5
    ; loop body
6
    ; do whatever you want here
7
    ; use the counter in EBX, but do not modify it!
8

9
    INC     EBX                      ; increment
10

11
check:
12
    CMP     EBX, 9
13
    JLE     body

Sometimes the compiler changes the order of loop parts:

Listing 1.185: x86

1
MOV     [counter], 2                 ; initialization
2
JMP     label_check
3

4
label_increment:
5
    ADD     [counter], 1             ; increment
6

7
label_check:
8
    CMP     [counter], 10
9
    JGE     exit
10

11
    ; loop body
12
    ; do whatever you want here
13
    ; use the counter variable from the stack
14

15
    JMP     label_increment
16

17
exit:

Usually the condition is checked before the loop body, but the compiler may reverse it and place it after the loop body. This happens when the compiler is sure the condition is true on the first iteration, meaning the loop body must execute at least once.

Listing 1.186: x86

1
MOV     REG, 2                       ; initialization
2

3
body:
4
    ; loop body
5
    ; do whatever you want here
6
    ; use the counter in REG, but do not modify it!
7

8
    INC     REG                      ; increment
9
    CMP     REG, 10
10
    JL      body

Using the LOOP instruction

This is very rare, and compilers do not use it.

If you see it, the code is most likely hand-written by a human, not a compiler.

Listing 1.187: x86

1
; counting down from 10 to 1
2
MOV     ECX, 10
3

4
body:
5
    ; loop body
6
    ; do whatever you want here
7
    ; use the counter in ECX, but do not modify it!
8

9
    LOOP    body

ARM

In this example, register R4 is dedicated to the counter:

Listing 1.188: ARM

1
MOV     R4, 2                        ; initialization
2
B       check
3

4
body:
5
    ; loop body
6
    ; do whatever you want here
7
    ; use the counter in R4, but do not modify it!
8

9
    ADD     R4, R4, #1               ; increment
10

11
check:
12
    CMP     R4, #10
13
    BLT     body

0xV3n0m

1.21 Loops

x86

x86: x32dbg

x86: tracer

ARM

Keil 6/2013: ARM mode (no optimization)

Optimizing Keil 6/2013: Thumb mode

Optimizing Xcode 4.6.3 (LLVM): Thumb-2 mode

ARM64: Optimizing GCC 4.9.1

ARM64: Non-optimizing GCC 4.9.1

MIPS

Listing 1.176: Non-optimizing GCC 4.4.5 (IDA)

Final note

1.21.2 Memory blocks copying routine

ARM – ARM mode

MIPS

Vectorization

1.21.3 Condition check

GCC 5.4.0 x64 – Optimized

Hex-Rays (at least version 2.2.0) decompiles it as:

1.22.4 Conclusion

General skeleton for a loop from 2 to 9 inclusive:

Listing 1.182: x86

The increment can be done with 3 instructions in non-optimized code:

Listing 1.183: x86

If the loop body is small, we can dedicate a full register to the counter:

Listing 1.184: x86

Sometimes the compiler changes the order of loop parts:

Listing 1.185: x86

Usually the condition is checked before the loop body, but the compiler may reverse it and place it after the loop body. This happens when the compiler is sure the condition is true on the first iteration, meaning the loop body must execute at least once.

Listing 1.186: x86

Using the LOOP instruction

Listing 1.187: x86

ARM

Listing 1.188: ARM

Table of Contents