1.21 Loops
x86
The author began this section by saying that there is a special instruction called LOOP in the x86 instruction set.
This instruction checks the value in the ECX register; if it is not zero, it decrements ECX by one and transfers control to the label specified in the LOOP operand.
Generally, this instruction is not very convenient, and modern compilers do not generate it automatically.
Therefore, if you see this instruction in code, it is highly likely that the code is hand-written Assembly.
In C/C++, loops are usually implemented using:
for()while()do/while()
Let us start with for().
This statement specifies:
- Loop initialization (setting the initial value of the counter)
- Loop condition (is the counter greater or less than a certain limit?)
- What happens at each iteration (increment or decrement)
- And of course, the loop body
The general form:
for (initialization; condition; at_each_iteration) // general form of for loop{ loop_body; // the body of the loop};The generated code also consists of four parts.
Let us start with a simple example:
#include <stdio.h> // include standard I/O header file
void printing_function(int i) // define function that takes int parameter i{ printf ("f(%d)\n", i); // print "f(value_of_i)" followed by newline};
int main() // program entry point{ int i; // declare integer variable i for (i=2; i<10; i++) // loop: start i=2, continue while i<10, increment i each iteration printing_function(i); // call printing_function with current i return 0; // return 0 to indicate successful termination};Result (MSVC 2010):
_i$ = -4 ; offset of local variable i on stack_main PROC push ebp ; save base pointer (standard prologue) mov ebp, esp ; set up stack frame push ecx ; allocate 4 bytes for local variable i
mov DWORD PTR _i$[ebp], 2 ; initialization: set i = 2 jmp SHORT $LN3@main ; jump to condition check
$LN2@main: mov eax, DWORD PTR _i$[ebp] ; load current i into EAX add eax, 1 ; increment i mov DWORD PTR _i$[ebp], eax ; store incremented value back to i
$LN3@main: cmp DWORD PTR _i$[ebp], 10 ; compare i with 10 jge SHORT $LN1@main ; if i >= 10, exit loop
mov ecx, DWORD PTR _i$[ebp] ; load i into ECX (argument for call) push ecx ; push argument onto stack call _printing_function ; call printing_function add esp, 4 ; clean up stack (remove argument)
jmp SHORT $LN2@main ; jump back to increment part
$LN1@main: ; loop exit point xor eax, eax ; set return value to 0 mov esp, ebp ; restore ESP pop ebp ; restore EBP (standard epilogue) ret 0 ; return_main ENDPThere is nothing particularly strange; I have marked the important parts in the code.
GCC 4.4.1 produces almost the same code, with only a small difference:
main proc nearvar_20 = dword ptr -20hvar_4 = dword ptr -4
push ebp ; save EBP mov ebp, esp ; set up stack frame and esp, 0FFFFFFF0h ; align stack to 16-byte boundary sub esp, 20h ; allocate local space
mov [esp+20h+var_4], 2 ; initialization: i = 2 jmp short loc_8048476 ; jump to condition check
loc_8048465: mov eax, [esp+20h+var_4] ; load i mov [esp+20h+var_20], eax ; pass i as argument call printing_function ; call function add [esp+20h+var_4], 1 ; i++
loc_8048476: cmp [esp+20h+var_4], 9 ; compare i with 9 jle short loc_8048465 ; if i <= 9, continue loop
mov eax, 0 ; return value 0 leave ; restore stack frame retn ; returnmain endp_main PROC push esi ; save ESI (callee-saved) mov esi, 2 ; i = 2 (using register)
$LL3@main: push esi ; push i as argument call _printing_function ; call function inc esi ; i++ add esp, 4 ; clean up stack
cmp esi, 10 ; compare i with 10 jl SHORT $LL3@main ; if i < 10, continue loop
xor eax, eax ; return value 0 pop esi ; restore ESI ret 0 ; return_main ENDPNow let us see what happens when we enable optimization (/Ox):
What happened here is slightly different: no stack space was allocated for the variable i, and the ESI register was used exclusively for it. This is possible in small functions with few local variables.
Another very important point: the printing_function() function must not change the value of ESI. The compiler is confident of this. If the compiler decided to use ESI inside printing_function(), it would have to save its value at the beginning and restore it at the end, as we saw with PUSH ESI / POP ESI.
Let us try GCC 4.4.1 with maximum optimization (-O3):
main proc nearvar_10 = dword ptr -10h
push ebp ; save EBP mov ebp, esp ; set up frame and esp, 0FFFFFFF0h ; align stack sub esp, 10h ; allocate space
mov [esp+10h+var_10], 2 ; pass 2 call printing_function ; call with 2 mov [esp+10h+var_10], 3 ; pass 3 call printing_function ; call with 3 mov [esp+10h+var_10], 4 ; pass 4 call printing_function ; call with 4 mov [esp+10h+var_10], 5 ; pass 5 call printing_function ; call with 5 mov [esp+10h+var_10], 6 ; pass 6 call printing_function ; call with 6 mov [esp+10h+var_10], 7 ; pass 7 call printing_function ; call with 7 mov [esp+10h+var_10], 8 ; pass 8 call printing_function ; call with 8 mov [esp+10h+var_10], 9 ; pass 9 call printing_function ; call with 9
xor eax, eax ; return 0 leave ; restore frame retn ; returnmain endpHere the situation changed: the loop was completely unrolled (loop unrolling). This has the advantage of saving execution time by removing loop control instructions, but at the cost of larger code size.
Large fully unrolled loops are not recommended nowadays because large functions require more instruction cache space.
Let us increase the upper limit of i to 100 and try again.
GCC produces:
public mainmain proc nearvar_20 = dword ptr -20h
push ebp mov ebp, esp and esp, 0FFFFFFF0h push ebx ; save EBX
mov ebx, 2 ; i = 2 sub esp, 1Ch
nop ; alignment
loc_80484D0: mov [esp+20h+var_20], ebx ; pass i as argument add ebx, 1 ; i++ call printing_function ; call function
cmp ebx, 64h ; i == 100? jnz short loc_80484D0 ; continue loop if not
add esp, 1Ch xor eax, eax ; return 0 pop ebx ; restore EBX mov esp, ebp pop ebp retnmain endp
This is very similar to what MSVC 2010 produces with optimization, except the EBX register is used as the counter i. GCC is confident that the register will not be changed inside printing_function(); otherwise it would save and restore it at the function entry/exit, as happened here in main().
x86: x32dbg
The author also used OllyDbg, but we will run this example in x32dbg as well.
We start by compiling the C code with the command:
cl /Ox /Ob0 /W3 /Fe:loop.exe test.c
We load the resulting EXE into x32dbg and go to main:
At this point it is clear that x32dbg has detected the loop and shows the arrow on the right.
We start pressing F8 (Step Over) and notice that ESI increases by one each time:
The last value the loop takes is 9, so after the increment the JL instruction is not executed and the function finishes:
x86: tracer
The author said that manually tracing inside a debugger is not very convenient, so we will try using tracer.
First, we open the example in IDA and find the address of the PUSH ESI instruction (which passes the only argument to printing_function()):
The address in this case is 0x401026.
We then run tracer (link to the tool used in the book: https://yurichev.com/tracer-en.html):
PS D:\Project> ./tracer.exe -l:loop.exe bpx=loop.exe!0x00401026
The -bpx option sets a breakpoint only at that address, and tracer then prints the register state.
In the tracer.log file we see:
Warning: no tracer.cfg file.PID=18080|New process loop.exeWarning: unknown (to us) INT3 breakpoint at ntdll.dll!LdrInitShimEngineDynamic+0x6e2 (0x77201ba2)(0) loop.exe!0x401026EAX=0x0022ce88 EBX=0x004c7000 ECX=0x00000000 EDX=0xc3de76bbESI=0x00000002 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=PF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000003 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF PF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000004 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF PF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000005 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000006 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF PF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000007 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000008 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF AF SF IF(0) loop.exe!0x401026EAX=0x00000005 EBX=0x004c7000 ECX=0x0020fe30 EDX=0x007b9415ESI=0x00000009 EDI=0x007ba678 EBP=0x0020ff1c ESP=0x0020fed4EIP=0x00211026FLAGS=CF PF AF SF IFPID=18080|Process loop.exe exited. ExitCode=0 (0x0)
We can see that the value of the ESI register changes from 2 to 9.
Moreover, tracer can collect register values for all addresses inside the function—this is called a trace.
It then generates an IDA .idc script that adds comments.
Since we know in IDA that the address of the main() function is 0x00401020, we run:
PS D:\Project> .\tracer.exe -l:loop.exe bpf=loop.exe!0x00401020,trace:ccBPF means: set a breakpoint on the entire function.
As a result, two scripts are produced:
loops_2.exe.idcloops_2.exe_clear.idc
We load loops_2.exe.idc into IDA and see:
We notice that ESI can be from 2 to 9 at the start of the loop body, but after the increment it becomes from 3 to 0xA (10).
We also see that main() finishes with EAX = 0. Tracer also produces a file loops_2.exe.txt.
This file contains information about the number of times each instruction was executed and register values:
0x401020 (BASE+0x1020), e= 1 [PUSH ESI] ESI=0xbc7d700x401021 (BASE+0x1021), e= 1 [MOV ESI, 2]0x401026 (BASE+0x1026), e= 8 [PUSH ESI] ESI=2..90x401027 (BASE+0x1027), e= 8 [CALL 0B41000h] tracing nested maximum level (1) reached, skipping this CALL 0B41000h=0xb410000x40102c (BASE+0x102c), e= 8 [INC ESI] ESI=2..90x40102d (BASE+0x102d), e= 8 [ADD ESP, 4] ESP=0x79f9b80x401030 (BASE+0x1030), e= 8 [CMP ESI, 0Ah] ESI=3..0xa0x401033 (BASE+0x1033), e= 8 [JL 0B41026h] SF=false,true OF=false0x401035 (BASE+0x1035), e= 1 [POP ESI]0x401036 (BASE+0x1036), e= 1 [XOR EAX, EAX]0x401038 (BASE+0x1038), e= 1 [RETN] EAX=0Here we can use grep, etc.
ARM
Keil 6/2013: ARM mode (no optimization)
main STMFD SP!, {R4,LR} ; save R4 and link register MOV R4, #2 ; i = 2 B loc_368 ; jump to condition check
loc_35C ; CODE XREF: main+1C MOV R0, R4 ; prepare argument (i) BL printing_function ; call printing_function ADD R4, R4, #1 ; i++
loc_368 ; CODE XREF: main+8 CMP R4, #0xA ; compare i with 10 BLT loc_35C ; if i < 10, continue loop MOV R0, #0 ; return value 0 LDMFD SP!, {R4,PC} ; restore R4 and return
Here the loop counter i is stored in R4. The MOV R4, #2 instruction only performs the initialization of i.
The instructions MOV R0, R4 and BL printing_function form the loop body: the first prepares the argument for printing_function(), the second calls it.
The ADD R4, R4, #1 instruction increments i at each iteration.
The CMP R4, #0xA instruction compares i with 10 (0xA).
The following BLT (Branch Less Than) instruction jumps if i is less than 10.
Otherwise, 0 is written to R0 (return value) and execution finishes.
Optimizing Keil 6/2013: Thumb mode
_main PUSH {R4,LR} ; save R4 and LR MOVS R4, #2 ; i = 2
loc_132 ; CODE XREF: _main+E MOVS R0, R4 ; prepare argument BL printing_function ; call function ADDS R4, R4, #1 ; i++ CMP R4, #0xA ; compare i with 10 BLT loc_132 ; if i < 10, continue MOVS R0, #0 ; return value 0 POP {R4,PC} ; restore R4 and returnSame as before, no difference.
Optimizing Xcode 4.6.3 (LLVM): Thumb-2 mode
_main PUSH {R4,R7,LR} MOVW R4, #0x1124 ; address of format string "%d\n" MOVS R1, #2 MOVT.W R4, #0 ADD R7, SP, #4 ADD R4, PC ; R4 = effective address of string
MOV R0, R4 BLX _printf ; print 2 MOV R0, R4 MOVS R1, #3 BLX _printf ; print 3 MOV R0, R4 MOVS R1, #4 BLX _printf ; print 4 MOV R0, R4 MOVS R1, #5 BLX _printf ; print 5 MOV R0, R4 MOVS R1, #6 BLX _printf ; print 6 MOV R0, R4 MOVS R1, #7 BLX _printf ; print 7 MOV R0, R4 MOVS R1, #8 BLX _printf ; print 8 MOV R0, R4 MOVS R1, #9 BLX _printf ; print 9
MOVS R0, #0 POP {R4,R7,PC}
In fact, this was the content of the printing_function() in my case:
void printing_function(int i){ printf ("%d\n", i);};
So LLVM not only unrolled the loop, but also inlined the simple printing_function(), placing its body 8 times instead of calling it. This is possible when the function is very simple and called a small number of times (as here).
ARM64: Optimizing GCC 4.9.1
printing_function: mov w1, w0 ; move argument to W1 adrp x0, .LC0 ; load page address of string add x0, x0, :lo12:.LC0 ; add low 12 bits offset b printf ; jump to printf
main: stp x29, x30, [sp, -32]! ; save frame pointer and link register add x29, sp, 0 ; set frame pointer str x19, [sp,16] ; save X19 (callee-saved) mov w19, 2 ; i = 2 (using callee-saved X19)
.L3: mov w0, w19 ; prepare argument add w19, w19, 1 ; i++ bl printing_function ; call function cmp w19, 10 ; compare i with 10 bne .L3 ; if i != 10, continue loop
mov w0, 0 ; return value 0 ldr x19, [sp,16] ; restore X19 ldp x29, x30, [sp], 32 ; restore FP/LR and deallocate ret ; return
.LC0: .string "f(%d)\n"ARM64: Non-optimizing GCC 4.9.1
.LC0: .string "f(%d)\n"
printing_function: stp x29, x30, [sp, -32]! ; save FP and LR add x29, sp, 0 ; set frame pointer str w0, [x29,28] ; store argument on stack adrp x0, .LC0 ; load string page add x0, x0, :lo12:.LC0 ; add offset ldr w1, [x29,28] ; load argument bl printf ; call printf ldp x29, x30, [sp], 32 ; restore and deallocate ret ; return
main: stp x29, x30, [sp, -32]! ; save FP and LR add x29, sp, 0 ; set frame pointer mov w0, 2 ; load 2 str w0, [x29,28] ; store i = 2 b .L3 ; jump to condition check
.L4: ldr w0, [x29,28] ; load i bl printing_function ; call function ldr w0, [x29,28] ; load i add w0, w0, 1 ; i++ str w0, [x29,28] ; store new i
.L3: ldr w0, [x29,28] ; load i cmp w0, 9 ; compare i with 9 ble .L4 ; if i <= 9, continue loop mov w0, 0 ; return 0 ldp x29, x30, [sp], 32 ; restore and deallocate ret ; returnMIPS
Listing 1.176: Non-optimizing GCC 4.4.5 (IDA)
main: addiu $sp, -0x28 ; allocate stack frame (40 bytes) sw $ra, 0x28-4($sp) ; save return address sw $fp, 0x28-8($sp) ; save frame pointer move $fp, $sp ; set frame pointer
li $v0, 2 ; load immediate 2 into $v0 sw $v0, 0x28-0x10($fp) ; store i = 2
b loc_9C ; jump to condition check or $at, $zero ; NOP (delay slot)
loc_80: lw $a0, 0x28-0x10($fp) ; load i into $a0 (argument) jal printing_function ; call printing_function or $at, $zero ; NOP (delay slot)
lw $v0, 0x28-0x10($fp) ; load i addiu $v0, 1 ; i++ sw $v0, 0x28-0x10($fp) ; store new i
loc_9C: lw $v0, 0x28-0x10($fp) ; load i slti $v0, 0xA ; set $v0 to 1 if i < 10 bnez $v0, loc_80 ; if true, continue loop or $at, $zero ; NOP (delay slot)
move $v0, $zero ; return value 0 move $sp, $fp ; restore stack pointer lw $ra, 0x28-4($sp) ; restore return address lw $fp, 0x28-8($sp) ; restore frame pointer addiu $sp, 0x28 ; deallocate stack frame jr $ra ; return or $at, $zero ; NOP (delay slot)
The only difference here is the new b instruction, which is generally a pseudo-instruction for BEQ.
Final note
In the generated code we notice that after initializing i, the loop body is not executed immediately—the condition is checked first, and only then the loop body may execute. This is correct, because if the condition is not met from the beginning, the loop body should not execute at all.
For example:
for (i=0; i<total_entries_to_process; i++) loop_body;Therefore the condition is checked before execution. However, with optimization the compiler may rearrange the order of condition check and loop body if it is sure that this case cannot occur (as in our simple example, and with compilers like Keil, Xcode (LLVM), MSVC with optimization).
1.21.2 Memory blocks copying routine
In real-world practice, memory copy routines often copy 4 or 8 bytes per iteration and may use SIMD, vectorization, and other advanced techniques.
To keep things simple, this example is the simplest possible form:
#include <stdio.h> // include standard I/O header
void my_memcpy (unsigned char* dst, unsigned char* src, size_t cnt) // define byte-by-byte memory copy function{ size_t i; // declare counter i for (i=0; i<cnt; i++) // loop from 0 to cnt-1 dst[i]=src[i]; // copy byte from source to destination};my_memcpy: ; Listing 1.177: GCC 4.9 x64 optimized for size (-Os) ; RDI = destination address ; RSI = source address ; RDX = block size xor eax, eax ; initialize counter i to 0 (EAX = 0)
.L2: cmp rax, rdx ; have all bytes been copied? if yes, exit je .L5 ; jump to exit if done
mov cl, BYTE PTR [rsi+rax] ; load byte from source + i into CL mov BYTE PTR [rdi+rax], cl ; store byte into destination + i
inc rax ; i++ jmp .L2 ; repeat loop
.L5: ret ; returnmy_memcpy: ; Listing 1.178: GCC 4.9 ARM64 optimized for size (-Os) ; X0 = destination address ; X1 = source address ; X2 = block size mov x3, 0 ; initialize counter i to 0
.L2: cmp x3, x2 ; are we done copying? beq .L5 ; if yes, exit
ldrb w4, [x1,x3] ; load byte from source + i into W4 strb w4, [x0,x3] ; store byte into destination + i
add x3, x3, 1 ; i++ b .L2 ; repeat loop
.L5: ret ; returnmy_memcpy PROC ; Listing 1.179: Optimizing Keil 6/2013 (Thumb mode) ; R0 = destination address ; R1 = source address ; R2 = block size PUSH {r4,lr} ; save R4 and LR
MOVS r3,#0 ; initialize counter i to 0
B |L0.12| ; jump to condition check (at end of loop)
|L0.6| LDRB r4,[r1,r3] ; load byte from source + i into R4 STRB r4,[r0,r3] ; store byte into destination + i ADDS r3,r3,#1 ; i++
|L0.12| CMP r3,r2 ; i < size? BCC |L0.6| ; if yes, continue loop
POP {r4,pc} ; restore R4 and returnENDPARM – ARM mode
Keil in ARM mode fully exploits conditional suffixes.
my_memcpy PROC ; R0 = destination address ; R1 = source address ; R2 = block size MOV r3,#0 ; initialize counter i to 0
|L0.4| CMP r3,r2 ; have all bytes been copied?
LDRBCC r12,[r1,r3] ; load byte from source + i (conditional: only if i < size) STRBCC r12,[r0,r3] ; store byte into destination + i (conditional) ADDCC r3,r3,#1 ; i++ (conditional)
BCC |L0.4| ; jump back to loop start if i < size
BX lr ; returnENDPThus there is only one jump instead of two.
MIPS
my_memcpy: ; Listing 1.181: GCC 4.4.5 optimized for size (-Os) (IDA) b loc_14 ; jump to condition check
move $v0, $zero ; initialize counter i to 0 (delay slot)
loc_8: ; CODE XREF: my_memcpy+1C lbu $v1, 0($t0) ; load unsigned byte from source address in $t0 into $v1
addiu $v0, 1 ; i++
sb $v1, 0($a3) ; store byte into destination address in $a3
loc_14: ; CODE XREF: my_memcpy sltu $v1, $v0, $a2 ; set $v1 to 1 if i < cnt
addu $t0, $a1, $v0 ; compute source + i address
bnez $v1, loc_8 ; if i < cnt, continue loop
addu $a3, $a0, $v0 ; compute destination + i address (delay slot)
jr $ra ; return or $at, $zero ; NOP (delay slot)Here are some instructions:
- LBU: Load Byte Unsigned (zero-extends the rest of the bits)
- LB: Load Byte with sign extension
- SB: Store Byte (lowest 8 bits)
Like ARM, all MIPS registers are 32-bit, so even when working with a single byte, a full 32-bit register must be used.
Vectorization
An optimized GCC version can do much more with this example, which will be explained later.
1.21.3 Condition check
The author explained that it is very important to remember that in the for() construct, the condition is not checked at the end, but at the beginning before the loop body executes. However, in many cases it is easier for the compiler to check the condition at the end after the loop body, sometimes adding an extra check at the beginning.
An example of this:
#include <stdio.h>
void f(int start, int finish){ for (; start<finish; start++) printf ("%d\n", start);};GCC 5.4.0 x64 – Optimized
f: cmp edi, esi ; condition check (1) jge .L9 ; if start >= finish, exit
push rbp push rbx mov ebp, esi ; save finish mov ebx, edi ; save start sub rsp, 8
.L5: mov edx, ebx ; prepare argument (current value) xor eax, eax mov esi, OFFSET FLAT:.LC0 ; "%d\n" mov edi, 1 add ebx, 1 ; increment current value call __printf_chk ; print
cmp ebp, ebx ; condition check (2) jne .L5 ; continue if current != finish
add rsp, 8 pop rbx pop rbp
.L9: rep ret ; returnHere we see two condition checks.
Hex-Rays (at least version 2.2.0) decompiles it as:
void __cdecl f(unsigned int start, unsigned int finish){ unsigned int v2; // ebx@2 __int64 v3; // rdx@3
if ( (signed int)start < (signed int)finish ) { v2 = start; do { v3 = v2++; _printf_chk(1LL, "%d\n", v3); } while ( finish != v2 ); }}
In this case, we can confidently replace the do/while() with for() and also remove the first condition check.
1.22.4 Conclusion
General skeleton for a loop from 2 to 9 inclusive:
Listing 1.182: x86
mov [counter], 2 ; initializationjmp check ; jump to condition check
body: ; loop body ; do whatever you want here ; use the counter variable from the stack add [counter], 1 ; increment
check: cmp [counter], 9 ; compare with 9 jle body ; if <= 9, continue loopThe increment can be done with 3 instructions in non-optimized code:
Listing 1.183: x86
MOV [counter], 2 ; initializationJMP check
body: ; loop body ; do whatever you want here ; use the counter variable from the stack
MOV REG, [counter] ; increment INC REG MOV [counter], REG
check: CMP [counter], 9 JLE bodyIf the loop body is small, we can dedicate a full register to the counter:
Listing 1.184: x86
MOV EBX, 2 ; initializationJMP check
body: ; loop body ; do whatever you want here ; use the counter in EBX, but do not modify it!
INC EBX ; increment
check: CMP EBX, 9 JLE bodySometimes the compiler changes the order of loop parts:
Listing 1.185: x86
MOV [counter], 2 ; initializationJMP label_check
label_increment: ADD [counter], 1 ; increment
label_check: CMP [counter], 10 JGE exit
; loop body ; do whatever you want here ; use the counter variable from the stack
JMP label_increment
exit:Usually the condition is checked before the loop body, but the compiler may reverse it and place it after the loop body. This happens when the compiler is sure the condition is true on the first iteration, meaning the loop body must execute at least once.
Listing 1.186: x86
MOV REG, 2 ; initialization
body: ; loop body ; do whatever you want here ; use the counter in REG, but do not modify it!
INC REG ; increment CMP REG, 10 JL bodyUsing the LOOP instruction
This is very rare, and compilers do not use it.
If you see it, the code is most likely hand-written by a human, not a compiler.
Listing 1.187: x86
; counting down from 10 to 1MOV ECX, 10
body: ; loop body ; do whatever you want here ; use the counter in ECX, but do not modify it!
LOOP bodyARM
In this example, register R4 is dedicated to the counter:
Listing 1.188: ARM
MOV R4, 2 ; initializationB check
body: ; loop body ; do whatever you want here ; use the counter in R4, but do not modify it!
ADD R4, R4, #1 ; increment
check: CMP R4, #10 BLT bodyIf this article helped you, please share it with others!
Some information may be outdated





