ARM
ARM: 3 integer arguments
The author said that the traditional ARM method for passing arguments (the calling convention) works as follows:
The first 4 arguments are sent through the registers R0 to R3, and the rest through the stack. And he said that this is similar to the method of passing arguments in fast-call or Win64 but this will be explained a little bit later.
ARM 32-bit
Keil 6/2013 (without Optimization) – ARM mode
Listing 1.53: Non-optimizing Keil 6/2013 (ARM mode)
.text:00000000 main
.text:00000000 10 40 2D E9 STMFD SP!, {R4,LR} ; save R4 and LR (return address) on stack
.text:00000004 03 30 A0 E3 MOV R3, #3 ; move value 3 into R3 (third argument)
.text:00000008 02 20 A0 E3 MOV R2, #2 ; move value 2 into R2 (second argument)
.text:0000000C 01 10 A0 E3 MOV R1, #1 ; move value 1 into R1 (first argument)
.text:00000010 08 00 8F E2 ADR R0, aADBDCD ; load address of format string into R0
; "a=%d; b=%d; c=%d"
.text:00000014 06 00 00 EB BL __2printf ; branch with link to printf (call printf)
.text:00000018 00 00 A0 E3 MOV R0, #0 ; move 0 into R0 (return value)
.text:0000001C 10 80 BD E8 LDMFD SP!, {R4,PC} ; restore R4 and PC from stack (return)
So the first 4 arguments were sent through the registers R0 – R3 in this order:
- The pointer of the printf string went in R0
- Then the value 1 in R1
- The value 2 in R2
- The value 3 in R3
And the instruction at address 0x18 writes 0 in R0 — and this means the return 0 statement in C.
Until now there is nothing strange.
Keil 6/2013 when working with optimization generates the same code.
Keil 6/2013 — with Optimization (Thumb mode)
Listing 1.54: Optimizing Keil 6/2013 (Thumb mode)
.text:00000000 main
.text:00000000 10 B5 PUSH {R4,LR} ; save R4 and LR on stack
.text:00000002 03 23 MOVS R3, #3 ; move value 3 into R3 (third argument)
.text:00000004 02 22 MOVS R2, #2 ; move value 2 into R2 (second argument)
.text:00000006 01 21 MOVS R1, #1 ; move value 1 into R1 (first argument)
.text:00000008 02 A0 ADR R0, aADBDCD ; load address of format string into R0
; "a=%d; b=%d; c=%d"
.text:0000000A 00 F0 0D F8 BL __2printf ; branch with link to printf (call printf)
.text:0000000E 00 20 MOVS R0, #0 ; move 0 into R0 (return value)
.text:00000010 10 BD POP {R4,PC} ; restore R4 and PC from stack (return)
There is no big difference between this code and the one without optimization in ARM mode.
Keil 6/2013 — Optimization (ARM mode) + we removed return
Let's modify the example and remove return 0:
#include <stdio.h>
void main()
{
printf("a=%d; b=%d; c=%d", 1, 2, 3);
};
The result will be a little bit strange:
Listing 1.55: Optimizing Keil 6/2013 (ARM mode)
.text:00000014 main
.text:00000014 03 30 A0 E3 MOV R3, #3 ; move value 3 into R3 (third argument)
.text:00000018 02 20 A0 E3 MOV R2, #2 ; move value 2 into R2 (second argument)
.text:0000001C 01 10 A0 E3 MOV R1, #1 ; move value 1 into R1 (first argument)
.text:00000020 1E 0E 8F E2 ADR R0, aADBDCD ; load address of format string into R0
; "a=%d; b=%d; c=%d\n"
.text:00000024 CB 18 00 EA B __2printf ; unconditional branch (jump) to printf
; no prologue/epilogue - tail call optimization
Let's focus a little bit together.
This is supposed to be the optimized version (-O3) in ARM mode which we will notice that the last instruction is B not BL as usual.
This is besides another difference which is that there is no prologue or epilogue at all (which are the instructions that save the values of R0 and LR).
The B instruction makes a direct jump to another address without modifying LR—like JMP in x86.
But why does this work in the first place?
Because the code is actually equivalent to the one before it, and the reason goes back to two important things:
- The stack and SP (stack pointer) were not touched.
- The call to printf() is the last instruction in the function, meaning after the call there are no other instructions.
So when printf finishes, it will return with the return to the address that exists in LR.
LR already has the address of the place from which main was called, so there is no need for us to save it.
And we don't need to modify LR because there are no other function calls except printf.
And after printf we won't do anything else at all!
That's why this shortcut can happen.
This type of optimization is done a lot in functions where the last line in them is a call to another function.
ARM64
GCC (Linaro) 4.9 without Optimization
Listing 1.56: Non-optimizing GCC (Linaro) 4.9
.LC1:
.string "a=%d; b=%d; c=%d"
f2:
; save FP and LR in stack frame:
stp x29, x30, [sp, -16]!
; set stack frame (FP=SP):
add x29, sp, 0
adrp x0, .LC1
add x0, x0, :lo12:.LC1
mov w1, 1
mov w2, 2
mov w3, 3
bl printf
mov w0, 0
; restore FP and LR
ldp x29, x30, [sp], 16
ret
The first instruction STP (Store Pair) stores the frame pointer FP (X29) and the link register LR (X30) onto the stack.
The next instruction ADD X29, SP, 0 creates the stack frame — it simply copies the current value of SP into X29.
Then we see the familiar ADRP / ADD pair that builds a pointer to the string.
The token lo12 means “lower 12 bits”, i.e., the linker will write the lower 12 bits of the address of .LC1 into the opcode of the ADD instruction.
The values 1, 2, 3 are 32-bit integers, so they are loaded into the lower 32-bit parts of the registers (the W- registers).
GCC (Linaro) 4.9 with optimization enabled generates exactly the same code.
ARM: 8 integer arguments
We’ll reuse the same example from before, but now with 8 arguments (actually 9 including the format string):
#include <stdio.h>
int main()
{
printf("a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%d; h=%d\n",
1, 2, 3, 4, 5, 6, 7, 8);
return 0;
};
Optimizing Keil 6/2013: ARM mode
.text:00000028 main
.text:00000028 var_18 = -0x18
.text:00000028 var_14 = -0x14
.text:00000028 var_4 = -4
.text:00000028 04 E0 2D E5 STR LR, [SP,#var_4]!
.text:0000002C 14 D0 4D E2 SUB SP, SP, #0x14
.text:00000030 08 30 A0 E3 MOV R3, #8
.text:00000034 07 20 A0 E3 MOV R2, #7
.text:00000038 06 10 A0 E3 MOV R1, #6
.text:0000003C 05 00 A0 E3 MOV R0, #5
.text:00000040 04 C0 8D E2 ADD R12, SP, #0x18+var_14
.text:00000044 0F 00 8C E8 STMIA R12, {R0-R3}
.text:00000048 04 00 A0 E3 MOV R0, #4
.text:0000004C 00 00 8D E5 STR R0, [SP,#0x18+var_18]
.text:00000050 03 30 A0 E3 MOV R3, #3
.text:00000054 02 20 A0 E3 MOV R2, #2
.text:00000058 01 10 A0 E3 MOV R1, #1
.text:0000005C 6E 0F 8F E2 ADR R0, aADBDCDDDEDFDGD ; "a=%d; b=%d; c=%d; ..."
.text:00000060 BC 18 00 EB BL __2printf
.text:00000064 14 D0 8D E2 ADD SP, SP, #0x14
.text:00000068 04 F0 9D E4 LDR PC, [SP+4+var_4],#4
We can break this code into several parts:
Function prologue
First instruction:
STR LR, [SP,#var_4]!
Stores LR onto the stack because we’re going to use that register for the call to printf().
The exclamation mark means this is a pre-index operation:
- First SP is decremented by 4
- Then the value in LR is stored at the address now pointed to by SP
This is analogous to PUSH in x86.
Next instruction:
SUB SP, SP, #0x14
Decrements SP to allocate 0x14 (20 bytes) on the stack.
We need to store five 32-bit values on the stack before calling printf, and each takes 4 bytes → 5 × 4 = 20 exactly.
The first four arguments are passed in registers.
Passing arguments 5, 6, 7, 8 via the stack
These values are first placed in R0–R3:
Then:
ADD R12, SP, #0x18+var_14
This puts into R12 the address where these four values should be stored on the stack.
var_14 is an IDA macro meaning -0x14.
So effectively: SP + 4 is placed into R12.
Then:
STMIA R12, {R0-R3}
Writes the contents of R0–R3 into memory pointed to by R12.
“Increment After” means R12 is increased by 4 after each store.
Passing argument 4 via the stack
The value 4 is placed in R0, then:
STR R0, [SP,#0x18+var_18]
writes it to a location on the stack.
var_18 = -0x18 → offset = 0.
So the value 4 is written at the address currently pointed to by SP.
Passing 1, 2, 3 via registers
Just before the call to printf:
These are the first three actual arguments.
Calling printf()
BL __2printf
Function epilogue
ADD SP, SP, #0x14
Restores SP to its value before we allocated stack space.
The values that were on the stack remain there but will be overwritten by future calls.
Then:
LDR PC, [SP+4+var_4],#4
Loads the previously saved LR into PC (i.e., returns from the function).
There is no exclamation mark, so this is post-index:
- First PC receives the value pointed to by SP
- Then SP is incremented by 4
IDA writes it this way to clearly show variable locations on the stack.
This instruction is very similar to POP PC in x86.
Optimizing Keil 6/2013: Thumb mode
.text:0000001C printf_main2
.text:0000001C var_18 = -0x18
.text:0000001C var_14 = -0x14
.text:0000001C var_8 = -8
.text:0000001C 00 B5 PUSH {LR}
.text:0000001E 08 23 MOVS R3, #8
.text:00000020 85 B0 SUB SP, SP, #0x14
.text:00000022 04 93 STR R3, [SP,#0x18+var_8]
.text:00000024 07 22 MOVS R2, #7
.text:00000026 06 21 MOVS R1, #6
.text:00000028 05 20 MOVS R0, #5
.text:0000002A 01 AB ADD R3, SP, #0x18+var_14
.text:0000002C 07 C3 STMIA R3!, {R0-R2}
.text:0000002E 04 20 MOVS R0, #4
.text:00000030 00 90 STR R0, [SP,#0x18+var_18]
.text:00000032 03 23 MOVS R3, #3
.text:00000034 02 22 MOVS R2, #2
.text:00000036 01 21 MOVS R1, #1
.text:00000038 A0 A0 ADR R0, aADBDCDDDEDFDGD ; "a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%"...
.text:0000003A 06 F0 D9 F8 BL __2printf
.text:0000003E loc_3E
.text:0000003E 05 B0 ADD SP, SP, #0x14
.text:00000040 00 BD POP {PC}
The output is almost identical to the previous example.
The only real difference is that we are now in Thumb mode, which causes the values to be placed on the stack in a slightly different order:
- Value 8 is stored first
- Then 5, 6, 7
- And value 4 comes third
Optimizing Xcode 4.6.3 (LLVM): ARM mode
__text:0000290C _printf_main2
__text:0000290C var_1C = -0x1C
__text:0000290C var_C = -0xC
__text:0000290C 80 40 2D E9 STMFD SP!, {R7,LR}
__text:00002910 0D 70 A0 E1 MOV R7, SP
__text:00002914 14 D0 4D E2 SUB SP, SP, #0x14
__text:00002918 70 05 01 E3 MOV R0, #0x1570
__text:0000291C 07 C0 A0 E3 MOV R12, #7
__text:00002920 00 00 40 E3 MOVT R0, #0
__text:00002924 04 20 A0 E3 MOV R2, #4
__text:00002928 00 00 8F E0 ADD R0, PC, R0
__text:0000292C 06 30 A0 E3 MOV R3, #6
__text:00002930 05 10 A0 E3 MOV R1, #5
__text:00002934 00 20 8D E5 STR R2, [SP,#0x1C+var_1C]
__text:00002938 0A 10 8D E9 STMFA SP, {R1,R3,R12}
__text:0000293C 08 90 A0 E3 MOV R9, #8
__text:00002940 01 10 A0 E3 MOV R1, #1
__text:00002944 02 20 A0 E3 MOV R2, #2
__text:00002948 03 30 A0 E3 MOV R3, #3
__text:0000294C 10 90 8D E5 STR R9, [SP,#0x1C+var_C]
__text:00002950 A4 05 00 EB BL _printf
__text:00002954 07 D0 A0 E1 MOV SP, R7
__text:00002958 80 80 BD E8 LDMFD SP!, {R7,PC}
Essentially the same as what we’ve seen before, except for the STMFA instruction (Store Multiple Full Ascending), which is a synonym for STMIB (Store Multiple Increment Before).
This instruction increments the address in SP first and then writes the register values — the opposite order of the usual store-multiple.
Another noticeable thing is that the instructions appear to be ordered somewhat randomly.
For example, register R0 is touched in three different places (addresses 0x2918, 0x2920, and 0x2928), even though it could have been done in one place.
Nevertheless, an optimizing compiler may have good reasons for scheduling instructions this way to achieve higher execution efficiency.
Modern processors try to execute nearby instructions in parallel when possible.
For instance, instructions like MOVT R0, #0 and ADD R0, PC, R0 cannot run together because both modify R0.
But MOVT R0, #0 and MOV R2, #4 can execute simultaneously because there is no conflict.
The compiler presumably tries to generate code in this style.
Optimizing Xcode 4.6.3 (LLVM): Thumb-2 mode
__text:00002BA0 _printf_main2
__text:00002BA0 var_1C = -0x1C
__text:00002BA0 var_18 = -0x18
__text:00002BA0 var_C = -0xC
__text:00002BA0 80 B5 PUSH {R7,LR}
__text:00002BA2 6F 46 MOV R7, SP
__text:00002BA4 85 B0 SUB SP, SP, #0x14
__text:00002BA6 41 F2 D8 20 MOVW R0, #0x12D8
__text:00002BAA 4F F0 07 0C MOV.W R12, #7
__text:00002BAE C0 F2 00 00 MOVT.W R0, #0
__text:00002BB2 04 22 MOVS R2, #4
__text:00002BB4 78 44 ADD R0, PC ; char *
__text:00002BB6 06 23 MOVS R3, #6
__text:00002BB8 05 21 MOVS R1, #5
__text:00002BBA 0D F1 04 0E ADD.W LR, SP, #0x1C+var_18
__text:00002BBE 00 92 STR R2, [SP,#0x1C+var_1C]
__text:00002BC0 4F F0 08 09 MOV.W R9, #8
__text:00002BC4 8E E8 0A 10 STMIA.W LR, {R1,R3,R12}
__text:00002BC8 01 21 MOVS R1, #1
__text:00002BCA 02 22 MOVS R2, #2
__text:00002BCC 03 23 MOVS R3, #3
__text:00002BCE CD F8 10 90 STR.W R9, [SP,#0x1C+var_C]
__text:00002BD2 01 F0 0A EA BLX _printf
__text:00002BD6 05 B0 ADD SP, SP, #0x14
__text:00002BD8 80 BD POP {R7,PC}
Exactly the same idea as the previous example — the only difference is the use of Thumb/Thumb-2 instructions instead of full ARM instructions.
ARM64: Non-optimizing GCC (Linaro) 4.9
.LC2:
.string "a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%d; h=%d\n"
f3:
; allocate more space on stack:
sub sp, sp, #32
; save FP and LR in stack frame:
stp x29, x30, [sp,16]
; set frame pointer (FP=SP+16):
add x29, sp, 16
adrp x0, .LC2
add x0, x0, :lo12:.LC2
mov w1, 8 ; 9th argument (value 8)
str w1, [sp] ; store 9th argument on the stack
mov w1, 1
mov w2, 2
mov w3, 3
mov w4, 4
mov w5, 5
mov w6, 6
mov w7, 7
bl printf
sub sp, x29, #16
; restore FP and LR
ldp x29, x30, [sp,16]
add sp, sp, 32
ret
In ARM64 (AArch64), the first eight arguments are passed in X- or W- registers [Procedure Call Standard for the ARM 64-bit Architecture (AArch64), 2013].
The format string pointer is 64-bit, so it goes in X0.
All remaining values are 32-bit ints, so they are placed in the lower 32-bit halves of the registers (the W- registers).
The ninth argument (the value 8) is passed via the stack.
Indeed, you cannot pass an arbitrarily large number of arguments in registers because the number of registers is finite.
Optimized GCC (Linaro) 4.9 produces exactly the same code.
MIPS
3 integer arguments
Optimizing GCC 4.4.5
The main difference from the “Hello, world!” example is that in this case printf() is called instead of puts(), and 3 additional arguments are passed through the registers $5…$7 (or $A1…$A3). And for this reason these registers are marked with the prefix A-, which means that they are used to pass function arguments.
Listing 1.58: Optimizing GCC 4.4.5 (assembly output)
$LC0:
.ascii "a=%d; b=%d; c=%d\000"
main:
; function prologue:
lui $28,%hi(__gnu_local_gp) ; Load upper immediate: load high part of __gnu_local_gp address into $28
addiu $sp,$sp,-32 ; Add immediate unsigned: allocate 32 bytes on stack by subtracting from stack pointer
addiu $28,$28,%lo(__gnu_local_gp) ; Add immediate unsigned: add low part of __gnu_local_gp to $28 to complete address
sw $31,28($sp) ; Store word: save return address ($31) at stack offset 28
; load address of printf():
lw $25,%call16(printf)($28) ; Load word: load printf address (16-bit offset) from $28 into $25
; load address of the text string and set 1st argument of printf():
lui $4,%hi($LC0) ; Load upper immediate: load high part of text string address into $4 (1st arg)
addiu $4,$4,%lo($LC0) ; Add immediate unsigned: add low part to complete text string address in $4
; set 2nd argument of printf():
li $5,1 # 0x1 ; Load immediate: set 2nd arg to 1 in $5
; set 3rd argument of printf():
li $6,2 # 0x2 ; Load immediate: set 3rd arg to 2 in $6
; call printf():
jalr $25 ; Jump and link register: jump to printf address in $25, save return in $31
; set 4th argument of printf() (branch delay slot):
li $7,3 # 0x3 ; Load immediate: set 4th arg to 3 in $7 (executes in delay slot)
; function epilogue:
lw $31,28($sp) ; Load word: restore return address from stack offset 28 into $31
; set return value to 0:
move $2,$0 ; Move: set return value ($2) to 0
; return
j $31 ; Jump: jump to address in $31 (return)
addiu $sp,$sp,32 ; Add immediate unsigned: deallocate stack space (delay slot)
Listing 1.59: Optimizing GCC 4.4.5 (IDA)
.text:00000000 main:
.text:00000000
.text:00000000 var_10 = -0x10
.text:00000000 var_4 = -4
.text:00000000
; function prologue:
.text:00000000 lui $gp, (__gnu_local_gp >> 16) ; Load upper immediate: load high 16 bits of __gnu_local_gp into $gp
.text:00000004 addiu $sp, -0x20 ; Add immediate unsigned: allocate 32 bytes on stack
.text:00000008 la $gp, (__gnu_local_gp & 0xFFFF) ; Load address (pseudo): complete __gnu_local_gp address in $gp
.text:0000000C sw $ra, 0x20+var_4($sp) ; Store word: save return address ($ra) on stack
.text:00000010 sw $gp, 0x20+var_10($sp) ; Store word: save $gp on stack
; load address of printf():
.text:00000014 lw $t9, (printf & 0xFFFF)($gp) ; Load word: load printf address into $t9
; load address of the text string and set 1st argument of printf():
.text:00000018 la $a0, $LC0 # "a=%d; b=%d; c=%d" ; Load address (pseudo): set 1st arg to text string address
; set 2nd argument of printf():
.text:00000020 li $a1, 1 ; Load immediate: set 2nd arg to 1
; set 3rd argument of printf():
.text:00000024 li $a2, 2 ; Load immediate: set 3rd arg to 2
; call printf():
.text:00000028 jalr $t9 ; Jump and link register: call printf, save return in $ra
; set 4th argument of printf() (branch delay slot):
.text:0000002C li $a3, 3 ; Load immediate: set 4th arg to 3 (delay slot)
; function epilogue:
.text:00000030 lw $ra, 0x20+var_4($sp) ; Load word: restore return address
; set return value to 0:
.text:00000034 move $v0, $zero ; Move: set return value to 0
; return
.text:00000038 jr $ra ; Jump register: return
.text:0000003C addiu $sp, 0x20 ; Add immediate unsigned: deallocate stack (delay slot)
Here IDA merged the pair of LUI and ADDIU instructions into the pseudo instruction LA. And for this reason there is no instruction at address 0x1C: because the LA instruction occupies 8 bytes.
GCC 4.4.5 Non-optimizing
The non-optimizing GCC is more verbose:
$LC0:
.ascii "a=%d; b=%d; c=%d\000"
main:
; function prologue:
addiu $sp,$sp,-32 ; Add immediate unsigned: allocate 32 bytes on stack
sw $31,28($sp) ; Store word: save return address ($31)
sw $fp,24($sp) ; Store word: save frame pointer ($fp)
move $fp,$sp ; Move: set frame pointer to current stack pointer
lui $28,%hi(__gnu_local_gp) ; Load upper immediate: load high part of __gnu_local_gp
addiu $28,$28,%lo(__gnu_local_gp) ; Add immediate unsigned: complete __gnu_local_gp address
; load address of the text string:
lui $2,%hi($LC0) ; Load upper immediate: load high part of text string address into $2
addiu $2,$2,%lo($LC0) ; Add immediate unsigned: complete text string address in $2
; set 1st argument of printf():
move $4,$2 ; Move: set 1st arg ($4) to text string address
; set 2nd argument of printf():
li $5,1 # 0x1 ; Load immediate: set 2nd arg to 1
; set 3rd argument of printf():
li $6,2 # 0x2 ; Load immediate: set 3rd arg to 2
; set 4th argument of printf():
li $7,3 # 0x3 ; Load immediate: set 4th arg to 3
; get address of printf():
lw $2,%call16(printf)($28) ; Load word: load printf address into $2
nop ; No operation: delay slot filler
; call printf():
move $25,$2 ; Move: set $25 to printf address
jalr $25 ; Jump and link register: call printf
nop ; No operation: delay slot filler
; function epilogue:
lw $28,16($fp) ; Load word: restore $28 from frame pointer offset
; set return value to 0:
move $2,$0 ; Move: set return value to 0
move $sp,$fp ; Move: restore stack pointer from frame pointer
lw $31,28($sp) ; Load word: restore return address
lw $fp,24($sp) ; Load word: restore frame pointer
addiu $sp,$sp,32 ; Add immediate unsigned: deallocate stack
; return
j $31 ; Jump: return
nop ; No operation: delay slot filler
Listing 1.61: Non-optimizing GCC 4.4.5 (IDA)
.text:00000000 main:
.text:00000000
.text:00000000 var_10 = -0x10
.text:00000000 var_8 = -8
.text:00000000 var_4 = -4
.text:00000000
; function prologue:
.text:00000000 addiu $sp, -0x20 ; Add immediate unsigned: allocate 32 bytes on stack
.text:00000004 sw $ra, 0x20+var_4($sp) ; Store word: save return address
.text:00000008 sw $fp, 0x20+var_8($sp) ; Store word: save frame pointer
.text:0000000C move $fp, $sp ; Move: set frame pointer to stack pointer
.text:00000010 la $gp, __gnu_local_gp ; Load address (pseudo): set $gp to __gnu_local_gp
.text:00000018 sw $gp, 0x20+var_10($sp) ; Store word: save $gp on stack
; load address of the text string:
.text:0000001C la $v0, aADBDCD # "a=%d; b=%d; c=%d" ; Load address (pseudo): load text string address into $v0
; set 1st argument of printf():
.text:00000024 move $a0, $v0 ; Move: set 1st arg to text string
; set 2nd argument of printf():
.text:00000028 li $a1, 1 ; Load immediate: set 2nd arg to 1
; set 3rd argument of printf():
.text:0000002C li $a2, 2 ; Load immediate: set 3rd arg to 2
; set 4th argument of printf():
.text:00000030 li $a3, 3 ; Load immediate: set 4th arg to 3
; get address of printf():
.text:00000034 lw $v0, (printf & 0xFFFF)($gp) ; Load word: load printf address into $v0
.text:00000038 or $at, $zero ; Or: no operation (NOP)
; call printf():
.text:0000003C move $t9, $v0 ; Move: set $t9 to printf address
.text:00000040 jalr $t9 ; Jump and link register: call printf
.text:00000044 or $at, $zero ; NOP ; Or: no operation (delay slot)
; function epilogue:
.text:00000048 lw $gp, 0x20+var_10($fp) ; Load word: restore $gp
; set return value to 0:
.text:0000004C move $v0, $zero ; Move: set return value to 0
.text:00000050 move $sp, $fp ; Move: restore stack pointer
.text:00000054 lw $ra, 0x20+var_4($sp) ; Load word: restore return address
.text:00000058 lw $fp, 0x20+var_8($sp) ; Load word: restore frame pointer
.text:0000005C addiu $sp, 0x20 ; Add immediate unsigned: deallocate stack
; return
.text:00000060 jr $ra ; Jump register: return
.text:00000064 or $at, $zero ; NOP ; Or: no operation (delay slot)
8 integer arguments
Let's use again the example with 9 arguments from a previous part:
#include <stdio.h>
int main()
{
printf("a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%d; h=%d\n", 1, 2, 3, 4, 5, 6, 7, 8);
return 0;
};
Optimizing GCC 4.4.5
But only the first 4 arguments are passed in the registers $A0 … $A3, and the rest are passed via the stack.
And this is called O32 calling convention (and this is the most used one in the MIPS world).
Other calling conventions, or hand-written Assembly code, can use the registers for other purposes.
SW is short for “Store Word” (from register to memory).
MIPS does not have direct instructions for storing a value in memory, so it uses a pair of commands (LI / SW) to do that.
$LC0:
.ascii "a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%d; h=%d\012\000"
main:
; function prologue:
lui $28,%hi(__gnu_local_gp) ; Load upper immediate: load high part of __gnu_local_gp
addiu $sp,$sp,-56 ; Add immediate unsigned: allocate 56 bytes on stack
addiu $28,$28,%lo(__gnu_local_gp) ; Add immediate unsigned: complete __gnu_local_gp address
sw $31,52($sp) ; Store word: save return address at offset 52
; pass 5th argument in stack:
li $2,4 # 0x4 ; Load immediate: load 4 into $2
sw $2,16($sp) ; Store word: pass 5th arg (4) on stack at offset 16
; pass 6th argument in stack:
li $2,5 # 0x5 ; Load immediate: load 5 into $2
sw $2,20($sp) ; Store word: pass 6th arg (5) on stack at offset 20
; pass 7th argument in stack:
li $2,6 # 0x6 ; Load immediate: load 6 into $2
sw $2,24($sp) ; Store word: pass 7th arg (6) on stack at offset 24
; pass 8th argument in stack:
li $2,7 # 0x7 ; Load immediate: load 7 into $2
lw $25,%call16(printf)($28) ; Load word: load printf address into $25
sw $2,28($sp) ; Store word: pass 8th arg (7) on stack at offset 28
; pass 1st argument in $a0:
lui $4,%hi($LC0) ; Load upper immediate: load high part of text string into $4 (1st arg)
; pass 9th argument in stack:
li $2,8 # 0x8 ; Load immediate: load 8 into $2
sw $2,32($sp) ; Store word: pass 9th arg (8) on stack at offset 32
addiu $4,$4,%lo($LC0) ; Add immediate unsigned: complete text string address in $4
; pass 2nd argument in $a1:
li $5,1 ; Load immediate: set 2nd arg to 1
; pass 3rd argument in $a2:
li $6,2 ; Load immediate: set 3rd arg to 2
; call printf():
jalr $25 ; Jump and link register: call printf
; pass 4th argument in $a3 (branch delay slot):
li $7,3 ; Load immediate: set 4th arg to 3 (delay slot)
; function epilogue:
lw $31,52($sp) ; Load word: restore return address
; return value = 0:
move $2,$0 ; Move: set return value to 0
; return:
j $31 ; Jump: return
addiu $sp,$sp,56 ; Add immediate unsigned: deallocate stack (delay slot)
Listing 1.63: Optimizing GCC 4.4.5 (IDA)
.text:00000000 main:
.text:00000000
.text:00000000 var_28 = -0x28
.text:00000000 var_24 = -0x24
.text:00000000 var_20 = -0x20
.text:00000000 var_1C = -0x1C
.text:00000000 var_18 = -0x18
.text:00000000 var_10 = -0x10
.text:00000000 var_4 = -4
; function prologue:
.text:00000000 lui $gp, (__gnu_local_gp >> 16) ; Load upper immediate: high 16 bits of __gnu_local_gp
.text:00000004 addiu $sp, -0x38 ; Add immediate unsigned: allocate 56 bytes on stack
.text:00000008 la $gp, (__gnu_local_gp & 0xFFFF) ; Load address (pseudo): complete __gnu_local_gp
.text:0000000C sw $ra, 0x38+var_4($sp) ; Store word: save return address
.text:00000010 sw $gp, 0x38+var_10($sp) ; Store word: save $gp
; pass 5th argument:
.text:00000014 li $v0, 4 ; Load immediate: 4 into $v0 for 5th arg
.text:00000018 sw $v0, 0x38+var_28($sp) ; Store word: pass 5th arg on stack
; pass 6th:
.text:0000001C li $v0, 5 ; Load immediate: 5 into $v0 for 6th arg
.text:00000020 sw $v0, 0x38+var_24($sp) ; Store word: pass 6th arg on stack
; pass 7th:
.text:00000024 li $v0, 6 ; Load immediate: 6 into $v0 for 7th arg
.text:00000028 sw $v0, 0x38+var_20($sp) ; Store word: pass 7th arg on stack
; pass 8th:
.text:0000002C li $v0, 7 ; Load immediate: 7 into $v0 for 8th arg
.text:00000030 lw $t9, (printf & 0xFFFF)($gp) ; Load word: load printf address
.text:00000034 sw $v0, 0x38+var_1C($sp) ; Store word: pass 8th arg on stack
; prepare $a0:
.text:00000038 lui $a0, ($LC0 >> 16) ; Load upper immediate: high part of text string for 1st arg
.text:0000003C li $v0, 8 ; Load immediate: 8 into $v0 for 9th arg
.text:00000040 sw $v0, 0x38+var_18($sp) ; Store word: pass 9th arg on stack
.text:00000044 la $a0, ($LC0 & 0xFFFF) ; Load address (pseudo): complete text string address
; $a1, $a2:
.text:00000048 li $a1, 1 ; Load immediate: 2nd arg to 1
.text:0000004C li $a2, 2 ; Load immediate: 3rd arg to 2
; call printf():
.text:00000050 jalr $t9 ; Jump and link register: call printf
.text:00000054 li $a3, 3 ; Load immediate: 4th arg to 3 (delay slot)
; function epilogue:
.text:00000058 lw $ra, 0x38+var_4($sp) ; Load word: restore return address
.text:0000005C move $v0, $zero ; Move: return value to 0
; return:
.text:00000060 jr $ra ; Jump register: return
.text:00000064 addiu $sp, 0x38 ; Add immediate unsigned: deallocate stack (delay slot)
1.11.4 Conclusion
Skeleton of a function call on different architectures
Listing 1.66: x86
PUSH 3rd argument ; Push 3rd argument onto stack
PUSH 2nd argument ; Push 2nd argument onto stack
PUSH 1st argument ; Push 1st argument onto stack
CALL function ; Call the function
; modify stack pointer (if needed) ; Adjust stack pointer if necessary after call
Listing 1.67: x64 (MSVC)
MOV RCX, 1st argument ; Move 1st argument into RCX
MOV RDX, 2nd argument ; Move 2nd argument into RDX
MOV R8, 3rd argument ; Move 3rd argument into R8
MOV R9, 4th argument ; Move 4th argument into R9
...
PUSH 5th, 6th argument, etc. (if needed) ; Push additional arguments onto stack if needed
CALL function ; Call the function
; modify stack pointer (if needed) ; Adjust stack pointer if necessary
Listing 1.68: x64 (GCC)
MOV RDI, 1st argument ; Move 1st argument into RDI
MOV RSI, 2nd argument ; Move 2nd argument into RSI
MOV RDX, 3rd argument ; Move 3rd argument into RDX
MOV RCX, 4th argument ; Move 4th argument into RCX
MOV R8, 5th argument ; Move 5th argument into R8
MOV R9, 6th argument ; Move 6th argument into R9
...
PUSH 7th, 8th argument, etc. (if needed) ; Push additional arguments onto stack if needed
CALL function ; Call the function
; modify stack pointer (if needed) ; Adjust stack pointer if necessary
Listing 1.69: ARM (32-bit)
MOV R0, 1st argument ; Move 1st argument into R0
MOV R1, 2nd argument ; Move 2nd argument into R1
MOV R2, 3rd argument ; Move 3rd argument into R2
MOV R3, 4th argument ; Move 4th argument into R3
; pass 5th, 6th argument, etc. in stack (if needed) ; Pass additional args on stack if needed
BL function ; Branch with link: call function
; modify stack pointer (if needed) ; Adjust stack pointer if necessary
Listing 1.70: ARM64
MOV X0, 1st argument ; Move 1st argument into X0
MOV X1, 2nd argument ; Move 2nd argument into X1
MOV X2, 3rd argument ; Move 3rd argument into X2
MOV X3, 4th argument ; Move 4th argument into X3
MOV X4, 5th argument ; Move 5th argument into X4
MOV X5, 6th argument ; Move 6th argument into X5
MOV X6, 7th argument ; Move 7th argument into X6
MOV X7, 8th argument ; Move 8th argument into X7
; pass 9th, 10th argument, etc. in stack (if needed) ; Pass additional args on stack if needed
BL function ; Branch with link: call function
; modify stack pointer (if needed) ; Adjust stack pointer if necessary
Listing 1.71: MIPS (O32 calling convention)
LI $4, 1st argument ; AKA $A0 ; Load immediate: set 1st arg in $4 ($A0)
LI $5, 2nd argument ; AKA $A1 ; Load immediate: set 2nd arg in $5 ($A1)
LI $6, 3rd argument ; AKA $A2 ; Load immediate: set 3rd arg in $6 ($A2)
LI $7, 4th argument ; AKA $A3 ; Load immediate: set 4th arg in $7 ($A3)
; pass 5th, 6th, ... arguments in stack (if needed) ; Pass additional args on stack if needed
LW temp_reg, address_of_function ; Load word: load function address into temp_reg
JALR temp_reg ; Jump and link register: call function
1.11.5 By the way
The difference between the ways of passing arguments in x86, x64, fastcall, ARM, and MIPS shows an important fact:
- The processor (CPU) does not know anything about calling conventions.
- Registers like
- $A0 … $A3 in MIPS
- RCX/RDX/... in x64
are just agreements between the compiler and the linker.
- You can write hand-written assembly and pass arguments in any order, through:
- any register you choose
- or even through global variables (!!)
The processor does not care at all how the variables were passed it just executes the instructions.