0xV3n0m

Announcement

Welcome To My Personal Blog

1.13 Worth noting: global vs. local variables

As we knew before that Global Variables that the OS zeros them automatically unlike Local Variables

Sometimes, you have a global variable and forgot (initialize) and your program depends on it being zero from the beginning. After that you modify the program and move this global variable inside a function to make it local. Then it won't be zeroed during initialization again, and this can bring annoying Bugs

1.14 Accessing passed arguments

The author started saying that we understood that the function that calls another function sends the parameters through the stack. But how does the called function (callee) access these parameters?

A simple example:

1
#include <stdio.h>
2

3
int f (int a, int b, int c) { // defines function f that takes three integers and returns an integer
4
    return a * b + c; // returns the result of a multiplied by b plus c
5
}
6

7
int main() { // program entry point; defines the main function
8
    printf("%d\\n", f(1, 2, 3)); // calls printf to print the result of f(1,2,3) followed by a newline
9
    return 0; // returns success; indicates successful execution
10
}

1.14.1 x86

MSVC

Here this is what we see after the Compilation (MSVC 2010 Express):

1
_TEXT SEGMENT ; starts the text segment for code
2
_a$ = 8        ; size = 4 ; defines offset for parameter a (8 bytes from EBP)
3
_b$ = 12       ; size = 4 ; defines offset for parameter b (12 bytes from EBP)
4
_c$ = 16       ; size = 4 ; defines offset for parameter c (16 bytes from EBP)
5

6
_f PROC ; starts the procedure for function f
7
    push ebp ; pushes EBP onto the stack to save the previous base pointer
8
    mov ebp, esp ; sets EBP to current ESP, establishing the stack frame
9
    mov eax, DWORD PTR _a$[ebp] ; moves the value of a (at [EBP+8]) into EAX
10
    imul eax, DWORD PTR _b$[ebp] ; multiplies EAX by the value of b (at [EBP+12]) and stores in EAX
11
    add eax, DWORD PTR _c$[ebp] ; adds the value of c (at [EBP+16]) to EAX
12
    pop ebp ; pops the saved EBP to restore the previous stack frame
13
    ret 0 ; returns from the function, with 0 bytes to clean up (callee cleans)
14
_f ENDP ; ends the procedure for f
15

16
main PROC ; starts the procedure for main
17
    push ebp ; pushes EBP onto the stack
18
    mov ebp, esp ; sets EBP to ESP
19
    push 3              ; third parameter ; pushes 3 onto the stack as third argument
20
    push 2              ; second parameter ; pushes 2 onto the stack as second argument
21
    push 1              ; first parameter ; pushes 1 onto the stack as first argument
22
    call _f ; calls function f
23
    add esp, 12 ; cleans up 12 bytes from the stack (3 arguments * 4 bytes)
24
    push eax ; pushes the return value from f (in EAX) onto the stack for printf
25
    push OFFSET $SG2463 ; '%d', 0aH, 00H ; pushes the address of the format string for printf
26
    call _printf ; calls printf
27
    add esp, 8 ; cleans up 8 bytes from the stack (2 arguments * 4 bytes)
28

29
    ; return 0 ; comment for returning 0
30
    xor eax, eax ; sets EAX to 0 using XOR (faster than mov eax, 0)
31
    pop ebp ; restores EBP
32
    ret 0 ; returns from main
33
_main ENDP ; ends the main procedure

What we see here is that main() pushes 3 numbers onto the stack and calls f(int, int, int).

Accessing the parameters inside f() is done using macros like: _a$ = 8, in the same way we handle local variables but with positive offsets (addresses are determined by addition).

Meaning here, we point to the outer stack frame by adding the macro _a$ to the value in register EBP.

After that, the value of A is stored in register EAX. After executing IMUL instructions, the value in EAX will be the result of multiplying the value in it with the content of _b.

Then, using ADD we add the value of _c to EAX.

The value in EAX doesn't need to be moved, it's in the place it needs to be.

And upon returning to the caller, the value in EAX goes in the next call as a parameter to printf().

MSVC + OllyDbg

We will do this on the x32dbg as every time

We start first to compile the C with this Command:

1
cl /Zi /Od test.c ; compiles the C file test.c with debug information (/Zi) and optimization disabled (/Od)

After that, we start running it on x32 dbg and we set Breakpoint at Main

The first element in the stack frame is the stored old value of EBP, the second is RA (return address), the third is the first parameter in the function, then the second, then the third.

And I said to do another experiment to confirm my understanding of the Assembly code, which is that I wanted to make it output finally like 13 for example, and the supposed equation is a*b+c, and to return number 13, I had to change the b and make it 10 so that it becomes 1*10+3 = 13, and I changed it in this Instruction imul eax, dword ptr ss:[ebp+0x0C] and made it imul eax, eax,A and indeed it gave me the result I want

GCC

Let’s compile the same in GCC 4.4.1 and see the results in IDA:

1
;-------------------------
2
; Function: f
3
;------------------------- ; header comment for function f
4
public f ; declares f as public
5
f proc near ; starts the near procedure for f
6

7
arg_0 = dword ptr 8       ; 1st argument ; defines offset for first argument (at [EBP+8])
8
arg_4 = dword ptr 0Ch     ; 2nd argument ; defines offset for second argument (at [EBP+0Ch])
9
arg_8 = dword ptr 10h     ; 3rd argument ; defines offset for third argument (at [EBP+10h])
10

11
    push ebp ; pushes EBP onto the stack to save previous base pointer
12
    mov  ebp, esp ; moves ESP to EBP, setting up stack frame
13

14
    mov  eax, [ebp+arg_0]     ; load 1st argument ; moves first argument into EAX
15
    imul eax, [ebp+arg_4]     ; multiply by 2nd argument ; multiplies EAX by second argument
16
    add  eax, [ebp+arg_8]     ; add 3rd argument ; adds third argument to EAX
17

18
    pop  ebp ; pops EBP to restore previous stack frame
19
    retn ; returns from function
20

21
f endp ; ends the procedure for f
22

23
;-------------------------
24
; Function: main
25
;------------------------- ; header comment for function main
26
public main ; declares main as public
27
main proc near ; starts the near procedure for main
28

29
var_10 = dword ptr -10h ; defines local variable at [ESP-10h]
30
var_C  = dword ptr -0Ch ; defines local variable at [ESP-0Ch]
31
var_8  = dword ptr -8 ; defines local variable at [ESP-8]
32

33
    push ebp ; pushes EBP onto stack
34
    mov  ebp, esp ; sets EBP to ESP
35
    and  esp, 0FFFFFFF0h      ; align stack ; aligns ESP to 16-byte boundary by ANDing with 0xFFFFFFF0
36
    sub  esp, 10h             ; allocate 16 bytes ; subtracts 16 from ESP to allocate space
37

38
    mov [esp+10h+var_8], 3    ; 3rd argument ; stores 3 at [ESP+8] (third argument for f)
39
    mov [esp+10h+var_C], 2    ; 2nd argument ; stores 2 at [ESP+4] (second argument for f)
40
    mov [esp+10h+var_10], 1   ; 1st argument ; stores 1 at [ESP] (first argument for f)
41
    call f ; calls function f
42

43
    mov  edx, offset aD       ; "%d\\n" ; moves address of format string to EDX
44
    mov  [esp+10h+var_C], eax ; result from f() ; stores result from f (in EAX) at [ESP+4] for printf
45
    mov  [esp+10h+var_10], edx ; stores format string address at [ESP] for printf
46
    call _printf ; calls printf
47

48
    mov  eax, 0 ; sets EAX to 0 (return value)
49
    leave ; leaves stack frame (mov esp, ebp; pop ebp)
50
    retn ; returns from function
51

52
main endp ; ends the procedure for main

The result is almost like the previous one with some small differences we talked about before.

The stack pointer doesn't return to its place after the 2 functions (f and printf),

Because the LEAVE before the last one takes care of that in the end

1.14.2 x64

The story is a bit different in x86-64.

The arguments (first 4 or 6 of them) are sent through registers, meaning the callee reads them from registers not from the stack.

1
Listing 1.93: Optimizing MSVC 2012 x64 ; listing header
2

3
$SG2997 DB '%d', 0aH, 00H ; defines format string "%d\n" with null terminator
4

5
main PROC ; starts main procedure
6
    sub rsp, 40 ; allocates 40 bytes on stack
7
    mov edx, 2 ; sets EDX to 2 (second argument)
8
    lea r8d, QWORD PTR [rdx+1]   ; R8D = 3 ; loads 3 into R8D (third argument) using LEA (rdx+1)
9
    lea ecx, QWORD PTR [rdx-1]   ; ECX = 1 ; loads 1 into ECX (first argument) using LEA (rdx-1)
10
    call f ; calls f
11
    lea rcx, OFFSET FLAT:$SG2997 ; '%d' ; loads format string address into RCX
12
    mov edx, eax ; moves result from f (in EAX) to EDX (second argument for printf)
13
    call printf ; calls printf
14
    xor eax, eax ; sets EAX to 0
15
    add rsp, 40 ; restores stack by adding 40
16
    ret 0 ; returns
17
main ENDP ; ends main
18

19
f PROC ; starts f procedure
20
    ; ECX - 1st argument ; comment: first arg in ECX
21
    ; EDX - 2nd argument ; comment: second arg in EDX
22
    ; R8D - 3rd argument ; comment: third arg in R8D
23
    imul ecx, edx ; multiplies ECX by EDX, result in ECX
24
    lea eax, DWORD PTR [r8+rcx] ; loads (r8 + rcx) into EAX using LEA
25
    ret 0 ; returns
26
f ENDP ; ends f

Let's explain things in a simpler way

In x64 (64-bit) there is a new calling convention in Windows called Microsoft x64 calling convention.

This convention says first 4 arguments → are sent in registers

Meaning function f(a, b, c, d) receives them like this:

a in ECX
b in EDX
c in R8D
d in R9D

And this is what appears in the Assembly code above

And the LEA instruction here is used for addition, and the compiler clearly saw it as faster than ADD.

And also LEA was used in main() to prepare the first and third argument for function f(). The compiler probably decided that this would be faster than doing MOV.

Now let's see the non-optimizing version from MSVC:

1
Listing 1.94: MSVC 2012 x64 ; listing header
2

3
f proc near ; starts f
4
    ; shadow space: ; comment for shadow space
5
    arg_0  = dword ptr 8 ; defines arg_0 at [RSP+8]
6
    arg_8  = dword ptr 10h ; defines arg_8 at [RSP+10h]
7
    arg_10 = dword ptr 18h ; defines arg_10 at [RSP+18h]
8

9
    ; ECX - 1st argument ; comment
10
    ; EDX - 2nd argument ; comment
11
    ; R8D - 3rd argument ; comment
12

13
    mov [rsp+arg_10], r8d ; stores third arg (R8D) at [RSP+18h]
14
    mov [rsp+arg_8],  edx ; stores second arg (EDX) at [RSP+10h]
15
    mov [rsp+arg_0],  ecx ; stores first arg (ECX) at [RSP+8]
16

17
    mov eax, [rsp+arg_0] ; loads first arg into EAX
18
    imul eax, [rsp+arg_8] ; multiplies EAX by second arg
19
    add eax, [rsp+arg_10] ; adds third arg to EAX
20
    retn ; returns
21
f endp ; ends f
22

23
main proc near ; starts main
24
    sub rsp, 28h ; allocates 40 bytes (0x28h) on stack
25
    mov r8d, 3   ; 3rd argument ; sets R8D to 3
26
    mov edx, 2   ; 2nd argument ; sets EDX to 2
27
    mov ecx, 1   ; 1st argument ; sets ECX to 1
28
    call f ; calls f
29
    mov edx, eax ; moves result to EDX for printf
30
    lea rcx, $SG2931 ; "%d\\n" ; loads format string into RCX
31
    call printf ; calls printf
32

33
    ; return 0 ; comment
34
    xor eax, eax ; sets EAX to 0
35
    add rsp, 28h ; restores stack
36
    retn ; returns
37
main endp ; ends main

The view is a bit confusing because the 3 arguments coming from registers were stored in the stack for some reason.

This is called "Shadow Space":

shadow space = 4 places reserved on the stack always before any call in x64

Why? For two things?

Every function in Windows must expect that there is space on the stack ready to receive the 4 arguments even if there are no arguments at all and this makes it easier for the debugger and ABI.
The function can copy the parameters from registers to the stack if it needs to use them as variables

And this we saw here:

1
mov [rsp+arg_10], r8d   ; copy 3rd arg ; stores R8D (third arg) to shadow space
2
mov [rsp+arg_8], edx    ; copy 2nd arg ; stores EDX (second arg) to shadow space
3
mov [rsp+arg_0], ecx    ; copy 1st arg ; stores ECX (first arg) to shadow space

GCC

The GCC which is made for Optimization outputs somewhat understandable code:

1
f: ; label for function f
2
    ; EDI - 1st argument ; comment: first arg in EDI
3
    ; ESI - 2nd argument ; comment: second arg in ESI
4
    ; EDX - 3rd argument ; comment: third arg in EDX
5
    imul    esi, edi ; multiplies ESI by EDI, result in ESI
6
    lea     eax, [rdx + rsi] ; loads (RDX + RSI) into EAX using LEA
7
    ret ; returns
8

9
main: ; label for main
10
    sub     rsp, 8 ; allocates 8 bytes on stack
11
    mov     edx, 3 ; sets EDX to 3 (third arg)
12
    mov     esi, 2 ; sets ESI to 2 (second arg)
13
    mov     edi, 1 ; sets EDI to 1 (first arg)
14
    call    f ; calls f
15

16
    mov     edi, OFFSET FLAT:.LC0   ; "%d\\n" ; sets EDI to format string address
17
    mov     esi, eax ; sets ESI to result from f
18
    xor     eax, eax                ; number of vector registers passed ; zeros EAX (for vector registers count)
19
    call    printf ; calls printf
20

21
    xor     eax, eax ; zeros EAX (return 0)
22
    add     rsp, 8 ; restores stack
23
    ret ; returns

Non-optimizing GCC

1
f: ; label for f
2
; EDI -  Argument ; comment: first arg in EDI
3
; ESI - Argument ; comment: second arg in ESI
4
; EDX - Argument ; comment: third arg in EDX
5
push rbp ; pushes RBP
6
mov rbp, rsp ; sets RBP to RSP
7
mov DWORD PTR [rbp-4], edi ; stores first arg at [RBP-4]
8
mov DWORD PTR [rbp-8], esi ; stores second arg at [RBP-8]
9
mov DWORD PTR [rbp-12], edx ; stores third arg at [RBP-12]
10
mov eax, DWORD PTR [rbp-4] ; loads first arg into EAX
11
imul eax, DWORD PTR [rbp-8] ; multiplies EAX by second arg
12
add eax, DWORD PTR [rbp-12] ; adds third arg to EAX
13
leave ; leaves frame
14
ret ; returns
15

16
main: ; label for main
17
push rbp ; pushes RBP
18
mov rbp, rsp ; sets RBP to RSP
19
mov edx, 3 ; sets EDX to 3
20
mov esi, 2 ; sets ESI to 2
21
mov edi, 1 ; sets EDI to 1
22
call f ; calls f
23
mov edx, eax ; moves result to EDX
24
mov eax, OFFSET FLAT:.LC0 ; "%d\\n" ; sets EAX to format string
25
mov esi, edx ; sets ESI to result
26
mov rdi, rax ; sets RDI to format string (from EAX)
27
mov eax, 0 ; sets EAX to 0 (vector registers)
28
call printf ; calls printf
29
mov eax, 0 ; sets EAX to 0
30
leave ; leaves
31
ret ; returns

There is no such thing as Shadow Space in System V of UNIX systems, but the callee (the function that was called) can save the arguments in the place it wants if there is a shortage in registers.

GCC: uint64_t instead of int

Our example works on int 32-bit, that's why the code uses 32-bit parts of registers that start with (E).

We can change it a bit to work with 64-bit values:

1
#include <stdio.h> ; includes stdio.h
2
#include <stdint.h> ; includes stdint.h for uint64_t
3

4
uint64_t f (uint64_t a, uint64_t b, uint64_t c) ; defines f with uint64_t types
5
{
6
    return a*b+c; ; returns a*b + c
7
};
8

9
int main() ; main function
10
{
11
    printf ("%lld\\n", f(0x1122334455667788, ; calls printf with f result
12
                        0x1111111122222222,
13
                        0x3333333344444444));
14
    return 0; ; returns 0
15
};

Listing 1.97: Optimizing GCC 4.4.6 x64

1
f proc near ; starts f
2
imul rsi, rdi ; multiplies RSI by RDI
3
lea rax, [rdx+rsi] ; loads (RDX + RSI) into RAX
4
retn ; returns
5
f endp ; ends f
6

7
main proc near ; starts main
8
sub rsp, 8 ; allocates 8 bytes
9
mov rdx, 3333333344444444h ; 3rd argument ; sets RDX to 0x3333333344444444
10
mov rsi, 1111111122222222h ; 2nd argument ; sets RSI to 0x1111111122222222
11
mov rdi, 1122334455667788h ; 1st argument ; sets RDI to 0x1122334455667788
12
call f ; calls f
13
mov edi, offset format ; "%lld\\n" ; sets EDI to format string
14
mov rsi, rax ; sets RSI to result
15
xor eax, eax ; number of vector registers passed ; zeros EAX
16
call _printf ; calls printf
17
xor eax, eax ; zeros EAX
18
add rsp, 8 ; restores stack
19
retn ; returns
20
main endp ; ends main

The code is the same,

The only difference is that this time the full registers (that start with R-) are the ones used

1.14.3 ARM

Non-optimizing Keil 6/2013 (ARM mode)

1
.text:000000A4    00 30 A0 E1    MOV     R3, R0 ; moves the value from R0 (first argument) to R3
2
.text:000000A8    93 21 20 E0    MLA     R0, R3, R1, R2 ; multiplies R3 by R1, adds R2, and stores the result in R0
3
.text:000000AC    1E FF 2F E1    BX      LR ; branches to the address in LR (return), possibly switching mode
4

5
.text:000000B0 main ; starts the main function
6
.text:000000B0    10 40 2D E9    STMFD   SP!, {R4, LR} ; stores R4 and LR on the stack (decrement before)
7
.text:000000B4    03 20 A0 E3    MOV     R2, #3 ; sets R2 to 3 (third argument)
8
.text:000000B8    02 10 A0 E3    MOV     R1, #2 ; sets R1 to 2 (second argument)
9
.text:000000BC    01 00 A0 E3    MOV     R0, #1 ; sets R0 to 1 (first argument)
10
.text:000000C0    F7 FF FF EB    BL      f ; branches to f and links (calls f)
11
.text:000000C4    00 40 A0 E1    MOV     R4, R0 ; moves the result from R0 to R4
12
.text:000000C8    04 10 A0 E1    MOV     R1, R4 ; moves R4 (result) to R1 (argument for printf)
13
.text:000000CC    5A 0F 8F E2    ADR     R0, aD_0        ; "%d\\n" ; gets the address of the format string into R0
14
.text:000000D0    E3 18 00 EB    BL      __2printf ; calls printf
15
.text:000000D4    00 00 A0 E3    MOV     R0, #0 ; sets R0 to 0 (return value)
16
.text:000000D8    10 80 BD E8    LDMFD   SP!, {R4, PC} ; loads R4 and PC from stack (return)

The main() function simply calls two functions, and sends 3 values to the first function — which is f().

As we said before, in ARM the first 4 values are usually sent in the first 4 registers (R0-R3).

And function f(), as seen, uses the first 3 registers (R0–R2) as arguments.

The MLA (Multiply Accumulate) instruction multiplies the first two operands (R3 and R1), then adds the third (R2), and places the result in the zero register (R0), and it's like this:

R0 = R3 * R1 + R2

Multiplication and addition at once (Fused multiply–add) is a very useful operation. By the way, there was no such instruction in x86 before the FMA-instructions appeared in SIMD.

The first instruction MOV R3, R0 seems redundant (could do one MLA and done).

The compiler didn't do Optimization for it because we are here in non-optimizing compilation.

The BX instruction returns control to the address stored in LR, and if needed, changes the processor mode from Thumb to ARM or vice versa.

And this may be necessary, because as you see, the function f() doesn't know it might be called from ARM code or Thumb code.

So if someone called it from Thumb, the BX instruction not only returns control, it also returns the mode to Thumb.

And if it's called from ARM then it doesn't change anything.

Optimizing Keil 6/2013 (ARM mode)

1
.text:00000098 f ; starts function f
2
.text:00000098    91 20 20 E0    MLA     R0, R1, R0, R2 ; multiplies R1 by R0, adds R2, stores in R0
3
.text:0000009C    1E FF 2F E1    BX      LR ; returns to LR, possibly switching mode

In the -O3 the function f() was assembled like this:

Here MOV was removed (or reduced), and MLA now uses all the input registers and places the result directly in R0 — which is the place the caller will read it from directly

Optimizing Keil 6/2013 (Thumb mode)

1
.text:0000005E    48 43          MULS    R0, R1 ; multiplies R0 by R1, sets flags, result in R0
2
.text:00000060    80 18          ADDS    R0, R0, R2 ; adds R2 to R0, sets flags
3
.text:00000062    70 47          BX      LR ; returns to LR

The MLA instruction is not available in Thumb mode, so compiler generates code that does the two operations (multiplication and addition) each separately.

The first instruction MULS multiplies R0 × R1, and places the result in R0.

The second instruction ADDS adds the result with R2 and leaves the result in R0.

ARM64

Optimizing GCC (Linaro) 4.9

Everything here is simple.

The MADD instruction is just an instruction that does multiplication + addition at the same time (similar to the MLA we saw before). The three arguments are sent in the 32-bit part of the X-registers. Indeed, the variable types are 32-bit ints.

The result is returned in W0.

1
f: ; starts function f
2
    madd w0, w0, w1, w2 ; multiplies w0 by w1, adds w2, stores in w0
3
    ret ; returns
4

5
main: ; starts main
6
    ; save FP and LR to stack frame ; comment
7
    stp x29, x30, [sp, -16]! ; stores x29 (FP) and x30 (LR) on stack, decrements SP by 16
8

9
    mov w2, 3 ; sets w2 to 3 (third argument)
10
    mov w1, 2 ; sets w1 to 2 (second argument)
11
    add x29, sp, 0 ; sets x29 to current SP (frame pointer)
12
    mov w0, 1 ; sets w0 to 1 (first argument)
13

14
    bl f ; branches to f and links (calls f)
15
    mov w1, w0 ; moves result from w0 to w1 (for printf)
16

17
    adrp x0, .LC7 ; gets page address of .LC7 into x0
18
    add  x0, x0, :lo12:.LC7 ; adds low 12 bits to get full address of format string
19
    bl printf ; calls printf
20

21
    ; return 0 ; comment
22
    mov w0, 0 ; sets w0 to 0
23

24
    ; restore FP & LR ; comment
25
    ldp x29, x30, [sp], 16 ; loads x29 and x30 from stack, increments SP by 16
26
    ret ; returns
27

28
.LC7: ; label for string
29
    .string "%d\\n" ; defines the format string "%d\n"

Now we extend all data types to 64-bit uint64_t and try:

1
f: ; starts f
2
    madd x0, x0, x1, x2 ; multiplies x0 by x1, adds x2, stores in x0
3
    ret ; returns
4

5
main: ; starts main
6
    mov  x1, 13396 ; moves lower part of value to x1
7
    adrp x0, .LC8 ; gets page address of .LC8
8
    stp  x29, x30, [sp, -16]! ; saves FP and LR
9

10
    movk x1, 0x27d0, lsl 16 ; moves next 16 bits to x1 shifted left by 16
11
    add  x0, x0, :lo12:.LC8 ; adds low bits for format string
12
    movk x1, 0x122,  lsl 32 ; moves next 16 bits shifted left by 32
13
    add  x29, sp, 0 ; sets FP
14
    movk x1, 0x58be, lsl 48 ; moves upper 16 bits shifted left by 48
15

16
    bl printf ; calls printf
17

18
    mov  w0, 0 ; sets w0 to 0
19
    ldp  x29, x30, [sp], 16 ; restores FP and LR
20
    ret ; returns
21

22
.LC8: ; label
23
    .string "%lld\\n" ; defines "%lld\n"

Non-optimizing GCC (Linaro) 4.9

1
f: ; starts f
2
    sub sp, sp, #16 ; subtracts 16 from SP (allocates space)
3

4
    str w0, [sp,12] ; stores w0 (first arg) at [SP+12]
5
    str w1, [sp,8] ; stores w1 (second arg) at [SP+8]
6
    str w2, [sp,4] ; stores w2 (third arg) at [SP+4]
7

8
    ldr w1, [sp,12] ; loads first arg into w1
9
    ldr w0, [sp,8] ; loads second arg into w0
10
    mul w1, w1, w0 ; multiplies w1 by w0, result in w1
11

12
    ldr w0, [sp,4] ; loads third arg into w0
13
    add w0, w1, w0 ; adds w1 to w0, result in w0
14

15
    add sp, sp, 16 ; adds 16 to SP (deallocates)
16
    ret ; returns

The code here saves the input arguments in the local stack, in case someone (or something) inside the function needs to use the registers W0…W2.

This prevents the original arguments from being overwritten because they might be needed again later.

And this is called Register Save Area.

From Procedure Call Standard for ARM64 (AArch64), 2013.

But the callee is not obligated to save them.

And this is a bit like the “Shadow Space” in: 1.14.2 page 129.

Why did the optimizing GCC 4.9 remove all the argument saving code?

Because it did extra optimization and concluded that arguments won't be needed again, and the registers W0…W2 won't be used.

And also we saw a pair of MUL / ADD instructions instead of one MADD.

1.14.4 MIPS

Optimizing GCC 4.4.5

1
.text:00000000 f: ; starts function f
2
    ; $a0 = a ; comment: first arg in $a0
3
    ; $a1 = b ; comment: second arg in $a1
4
    ; $a2 = c ; comment: third arg in $a2
5

6
    mult  $a1, $a0      ; multiply a0 * a1 ; multiplies $a1 by $a0, result in HI:LO
7
    mflo  $v0           ; get low 32 bits of result into v0 ; moves low part of multiplication result to $v0
8
    jr    $ra           ; return ; jumps to return address in $ra
9
    addu  $v0, $a2, $v0 ; branch delay slot: v0 = v0 + c ; adds $a2 to $v0 (in delay slot)
10

11
    ; result returned in $v0 ; comment
12

13
    .text:00000010 main: ; starts main
14

15
var_10 = -0x10 ; defines local var_10
16
var_4  = -4 ; defines local var_4
17

18
    lui   $gp, (__gnu_local_gp >> 16) ; loads upper 16 bits of __gnu_local_gp into $gp
19
    addiu $sp, -0x20 ; subtracts 0x20 from $sp (allocates stack)
20
    la    $gp, (__gnu_local_gp & 0xFFFF) ; loads lower part of __gnu_local_gp into $gp
21

22
    sw    $ra, 0x20+var_4($sp) ; stores $ra at [SP + 0x1C]
23
    sw    $gp, 0x20+var_10($sp) ; stores $gp at [SP + 0x10]
24

25
    ; set c ; comment
26
    li    $a2, 3 ; loads 3 into $a2 (third arg)
27

28
    ; set a ; comment
29
    li    $a0, 1 ; loads 1 into $a0 (first arg)
30

31
    jal   f             ; call f() ; jumps to f and links
32
    li    $a1, 2        ; branch delay slot: set b ; loads 2 into $a1 (second arg) in delay slot
33

34
    ; result now in $v0 ; comment
35

36
    lw    $gp, 0x20+var_10($sp) ; loads $gp from [SP + 0x10]
37
    lui   $a0, ($LC0 >> 16) ; loads upper bits of $LC0 into $a0
38
    lw    $t9, (printf & 0xFFFF)($gp) ; loads printf address into $t9
39
    la    $a0, ($LC0 & 0xFFFF) ; loads lower part of $LC0 into $a0
40

41
    jalr  $t9 ; jumps to printf address in $t9
42
    move  $a1, $v0      ; branch delay slot: pass result to printf ; moves $v0 to $a1 (argument for printf)
43

44
    lw    $ra, 0x20+var_4($sp) ; loads $ra from [SP + 0x1C]
45
    move  $v0, $zero ; sets $v0 to 0
46
    jr    $ra ; jumps to $ra (return)
47
    addiu $sp, 0x20     ; branch delay slot: restore stack ; adds 0x20 to $sp

The first four arguments for the function are sent in 4 registers, which are the registers that start with A-.

There are two special registers in the MIPS architecture:

HI and LO — these are where the 64-bit multiplication result is placed while MULT is running.

These registers can only be accessed using MFLO and MFHI instructions.

Here MFLO takes the low part of the multiplication result and places it in $v0.

The high part of the multiplication result (in HI) is discarded.

And this is normal because we are dealing with 32-bit int.

And then the ADDU instruction (“addition without sign”) adds the value of the third argument to the result.

There are two addition instructions in MIPS: ADD and ADDU.

The difference is not in signed or unsigned…

The difference is that ADD can do exception if overflow occurs, and this is sometimes useful.

But ADDU doesn't do exceptions.

And since C/C++ doesn't do exceptions in overflow,

That's why they used ADDU.

The 32-bit result is left in $v0.

In main() there is a new instruction: JAL (“Jump and Link”).

The difference between it and JALR:

The JAL has a relative offset (relative address).
The JALR jumps to the address inside a register.

And since f() and main() are in the same file, then the relative address of f is fixed and known.

Share

If this article helped you, please share it with others!

CH1.13 - Global vs. Local Variables & Accessing Passed Arguments

https://v3nn00m.github.io/posts/re4b/chapter113_to114/

Author

0xV3n0m

Published at

2025-12-05

License

0xV3n0m's Personal Blog License

Some information may be outdated

CH1.14 - More About Results Returning

CH1.12 - scanf() (Part 2)

0xV3n0m

1.13 Worth noting: global vs. local variables

1.14 Accessing passed arguments

1.14.1 x86

MSVC

MSVC + OllyDbg

GCC

1.14.2 x64

GCC

Non-optimizing GCC

GCC: uint64_t instead of int

1.14.3 ARM

Non-optimizing Keil 6/2013 (ARM mode)

Optimizing Keil 6/2013 (ARM mode)

Optimizing Keil 6/2013 (Thumb mode)

ARM64

Optimizing GCC (Linaro) 4.9

Non-optimizing GCC (Linaro) 4.9

1.14.4 MIPS

Optimizing GCC 4.4.5

Table of Contents