Reverse Engineering for Beginners (CH1.5 Hello, world!) {Part_2}
ARM

The author mentioned that in all the experiments we conducted on ARM, several compilers were used:
- The famous one in the Embedded Systems field: Keil version 6/2013.
- Apple Xcode 4.6.3 IDE with the LLVM-GCC 4.2 compiler.
- GCC 4.9 (Linaro) (for ARM64 architectures)
In all the examples in this book, ARM 32-bit code is used (including Thumb and Thumb-2 modes), unless stated otherwise.
When we talk about ARM 64-bit, we call it ARM64.
Keil 6/2013 Without Optimization (in ARM Mode)
Let’s start compiling our example on Keil:
The armcc compiler outputs Assembly in Intel Syntax, but with ARM-specific macros. What matters more to us is to see the instructions exactly as they are, so let’s look at the result in IDA:
In this example, we can easily see that each instruction is 4 bytes in size. This is because we compiled it in ARM mode, not Thumb.
The first instruction:
This means:
"Store Multiple Full Descending"
1. The SP (Stack Pointer) is decreased enough to save the values.
2. The values in R4 and LR are written into the Stack.
It’s similar to the `PUSH` command in x86,
but the difference is that here you can push multiple registers at the same time.
But be careful, the armcc compiler outputs PUSH {r4,lr} for simplicity, but this is not accurate. The PUSH command is only available in Thumb mode.
That’s why in IDA we see it in the real form: STMFD.
The second instruction:
This adds or subtracts the value in the PC (Program Counter) to get the offset for the string `"hello, world"`.
This is called Position-Independent Code (code that is not tied to a fixed address).
This code can be executed anywhere in memory because it relies on the difference (offset) between the code’s location and the data’s location.
This offset is calculated at runtime, so once we add the current instruction address (from PC), we can reach the actual address of our string.
The next instruction:
This makes a call to the `printf()` function, and here’s how it works:
- It saves the address after the BL (0xC) in the LR (Link Register).
- Then, it transfers control to the address of printf() by writing it into the PC.
When printf() finishes, it needs to return, and that’s done by the address stored in LR.
The difference here is that ARM processors (which are RISC) store the return address in LR, while x86 processors (CISC) place it on the stack.
The author will explain this in more detail in another section.
By the way, the BL instruction cannot store a full 32-bit address because it only has 24 bits for this space.
Since each instruction in ARM is 4 bytes (32-bit), it is placed on an address that is divisible by 4 (i.e., the last 2 bits are zero). These bits are ignored, leaving us with 26 bits to use as an offset.
This is enough to cover about ±32 MB around the current PC.
Afterwards, MOV R0, #0 writes the value 0 into register R0.
This is because the main function returns 0 at the end, and this return value is stored in R0.
The final instruction:
This reads the values from the stack and stores them in R4 and PC, then increments SP — essentially performing a POP.
Note:
- The first STMFD instruction stored (R4 and LR) on the stack.
- The LDMFD instruction returns (R4 and PC).
This is logical because initially, LR points to the return address from printf, and later it’s moved to PC to return control to the main caller.
After main finishes, the control is returned to the OS or CRT.
That’s why there’s no need to write BX LR at the end of the function.
Finally, DCB is a directive in assembly used to define an array of bytes or strings, similar to DB in x86 assembly.
Non-optimizing Keil 6/2013 (Thumb mode)
Let’s compile the same example, but this time in Thumb mode:
The result in IDA looked like this:
We can easily notice that the opcodes are all 2 bytes (16 bits) in size, which means this code is indeed in Thumb mode.
But keep in mind that the BL instruction here is composed of two instructions (each 16 bits), as it’s not possible to fit the full offset for printf() within a single 16-bit space.
Here’s what happens:
- The first 16 bits load the top 10 bits of the offset.
- The second instruction loads the lower 11 bits of the offset.
Since every Thumb instruction is 2 bytes, it means no instruction can be placed at an odd address. Therefore, the last bit of the address is discarded during encoding.
In the end, the BL in Thumb mode can encode an offset of around:
current_PC ± approximately 2 MB.
As for the other instructions in the example:
- PUSH / POP work here just like STMFD / LDMFD we explained earlier,
- but the difference is that the SP is not explicitly remembered.
- ADR does the same job.
- MOVS R0, #0 stores 0 in R0 so the function returns 0 at the end.
Optimizing Xcode 4.6.3 (LLVM) (ARM mode)
The author here used Xcode 4.6.3, but this time enabled Optimization (maximum level) using the switch:
Without optimization, unnecessary code is generated, so the author opted for the version with the fewest possible instructions.
Listing 1.27: Optimizing Xcode 4.6.3 (LLVM) (ARM mode)
The STMFD and LDMFD instructions are already familiar to us.
As for MOV, here it writes the value 0x1686 into R0, which is an offset pointing to the location of the string `"Hello world!"`.
The R7 register (as mentioned in the iOS ABI Function Call Guide - 2010) is used as the Frame Pointer (which will be explained later).
The MOVT R0, #0 (MOVe Top) writes zero into the upper 16 bits of the register.
Why?
In ARM mode, a regular MOV only writes to the lower 16 bits (the first 16 bits).
But if you want to write to the higher part of the register, you must use MOVT.
However, in this case, using MOVT was unnecessary, because the initial MOV already set the upper part to zero.
This is a small mistake from the compiler (an unnecessary addition).
Then:
This adds the value of PC to R0 to get the full address of the string.
This follows the same idea of position-independent code that we discussed earlier.
The BL _puts instruction calls the puts() function instead of printf().
LLVM here replaced printf with puts, and this makes sense because printf("Hello world!") and puts("Hello world!") produce the same result as long as there are no format specifiers like %d or %s.
But if there is a %, the result will differ.
The reason for this replacement?
Because puts is faster — it simply prints the string without looking for format specifiers within it.
At the end:
This writes zero into R0 as the return value.
Optimizing Xcode 4.6.3 (LLVM) (ARM mode) - Thumb-2 Mode
By default, Xcode 4.6.3 generates code for Thumb-2 in this manner:
In Thumb mode, the BL and BLX instructions are composed of pairs of 16-bit instructions.
However, in Thumb-2, these instructions were extended to allow 32-bit instructions.
This is evident because Thumb-2 instructions always start with the code 0xFx or 0xEx.
But in IDA, the byte order is reversed because ARM processors store bytes in reverse order:
- In ARM or ARM64 mode: Byte order is 4-3-2-1.
- In Thumb mode: 2-1.
- In a pair of 16-bit instructions in Thumb-2: 2-1-4-3.
That’s why we see that instructions like MOVW, MOVT.W, and BLX all start with 0xFx.
For example:
This places a 16-bit value in the lower part of the R0 register and zeroes out the upper bits.
Similarly, MOVT.W R0, #0 works the same as the MOVT in the previous example, but it’s the Thumb-2 version.
The difference is that here we use BLX instead of BL.
The difference between them is that BLX not only transfers execution to the puts() function and stores the address in LR, but it also switches the processor mode from Thumb or Thumb-2 to ARM (or vice versa).
This instruction is here because the location where the code is going has the following code:
This is simply a jump to the location where the real address of the puts() function is stored in the import section.
You might ask: “Why don’t we just call puts() directly instead of this roundabout way?”
The answer is that this method saves space.
Almost every program uses dynamic libraries (like DLLs in Windows, .so in Linux, or .dylib in macOS).
These libraries contain ready-made functions like puts().
In the executable file (like EXE, ELF, or Mach-O), there’s an import section, which contains the names of the functions or variables that the program imports from other libraries, along with the name of the library they come from.
The OS loader reads this list and retrieves the real addresses of these functions from memory.
In our case, __imp__puts is a 32-bit variable that stores the address of the puts() function in memory.
Then, the LDR instruction reads this address and puts it in PC so execution transfers to the function.
That’s why instead of writing the address of puts() every time, we write it once in a designated place.
Also, it’s not possible to place the full 32-bit value into a register in a single instruction without touching memory.
So the optimal solution is to create a small function in ARM mode whose job is to jump to the original function in the library.
This is called a Thunk Function.
The code in Thumb mode calls this small function.
By the way, in the previous example (which was in ARM mode), the switch happens via a regular BL because the mode doesn’t change (that’s why the "X" is not in the name).
About Thunk Functions:
Many people get confused about Thunk functions because the name is strange.
But the idea is simple: a Thunk is an adapter or a wrapper between two different modes or systems.
That’s why sometimes they’re also called Wrappers (around another function).
Examples:
- According to researcher P.Z. Ingerman, who invented the Thunk in 1961, a thunk is "a piece of code that calculates a certain value and leaves the address of the result in a known place."
- Later, Microsoft and IBM also used the term when they made 16-bit and 32-bit systems work together (like WOW – Windows on Windows).
- For example, from the LAPACK library (used for linear algebra) written in FORTRAN, C/C++ developers love to use it, but it’s impossible to rewrite all the code. So they create small functions in C (thunks), and these functions call the original FORTRAN functions from within.
Example:
These small functions are also called Wrappers.
ARM64
GCC
Let’s compile this example using GCC 4.8.1 on the ARM64 architecture:
In the .rodata section (which contains the string data):
In ARM64, there’s neither Thumb nor Thumb-2 mode, meaning everything is in ARM mode only, and all instructions are 32-bit in size.
The number of registers has doubled. The registers here are 64-bit and begin with X, whereas the lower part of them (32-bit only) is written with W.
The STP (Store Pair) instruction stores two registers at once in the stack: here X29 and X30, and of course, they can be stored anywhere in memory, but here it is written as SP (Stack Pointer). This means this pair is stored in the stack itself.
Since the register is 64-bit, each of them is 8 bytes, so together they occupy 16 bytes.
The exclamation mark ! after the operand means that this value (16) is subtracted from SP first, and then the values are written to the stack — this is called pre-index. (This is the opposite of post-index, which increases SP afterward).
Comparing with x86, the first instruction here is essentially the same as:
X29 is the Frame Pointer (FP), and X30 is the Link Register (LR). These are stored at the beginning of the function (prologue) and restored at the end (epilogue).
The second instruction mov x29, sp copies the address of the Stack Pointer into X29 to prepare the stack frame for the function.
Next, we have the ADRP and ADD instructions. Both are used to place the address of the string "Hello!" into register X0 (because the first parameter of the function is passed in X0).
There’s no single ARM instruction that can place a large number or a full address into a register (because the instruction size is limited to 4 bytes).
Thus, it is done in two steps:
- ADRP places the address of the page (4KB Page) containing the string.
- ADD adds the remaining part of the address.
The resulting address:
This is indeed the location of the string "Hello!" that we found in .rodata.
After that, BL puts@plt is used to call the puts() function (as we saw earlier).
Then MOV W0, #0 places zero into register W0, which is the lower 32 bits of register X0:
The function returns a result in X0, and since main() returns an int (which is 32 bits), it’s sufficient to fill only the lower part W0.
To confirm, let’s modify the example to make it return a uint64_t (64 bits):
#include <stdio.h>
#include <stdint.h>
uint64_t main() {
printf("Hello!"\n);
return 0;
}
You will get the same result, but now this instruction has become:
This means that when the function returns 64 bits, the value must be written in X0, not W0.
Next, LDP X29, X30, [SP], #16 reads the values we previously stored (X29 and X30) from the stack. But here there’s no exclamation mark — this means the values are pulled first, and then SP is increased by 16 (this is called post-index).
Finally, RET is a new instruction in ARM64. It performs the same job as BX LR, but with an extra bit that tells the processor this is a "return from function," not just a jump, making it execute faster.
Because the function is very simple, even in optimization mode (Optimized), GCC produces exactly the same code.
MIPS

There is a very important concept in the MIPS architecture called the “Global Pointer” (GP).
As we know, every instruction in MIPS is 32 bits in size, and that means it’s impossible to include a full 32-bit address inside a single instruction.
That’s why we have to use two instructions to load the full address (just like what GCC did in the previous example when loading the address of a text string).
But we can load data from an address within the range from register - 32768 up to register + 32767 using only one instruction.
That’s because 16 bits of signed offset can be encoded inside a single instruction.
So, we can dedicate a special register for this purpose, and also allocate a 64KiB space that contains the most frequently used data.
This register is called the “Global Pointer”, and it points exactly to the middle of that 64KiB space.
That space usually contains global variables and addresses of ready-made functions like printf().
That’s because the GCC developers decided that getting the address of any function should take only one instruction instead of two.
In ELF files (which are the executable file format in Linux), this 64KiB space is divided into two sections:
- .sbss (“small BSS”) for uninitialized data
- .sdata (“small data”) for initialized data
This means the programmer can choose which data should be accessed quickly and place it either in .sdata or .sbss as needed.
Some old programmers might remember the MS-DOS system, which divided the entire memory into 64KiB blocks — it’s basically the same idea here.
This concept isn’t exclusive to MIPS; at least the PowerPC architecture also used this same technique.
Optimizing GCC
Now let’s look at this example that explains the idea called the “Global Pointer” in the MIPS architecture.
Listing 1.32: Optimizing GCC 4.4.5 (assembly output)
Non-optimizing GCC
The Non-optimizing GCC version tends to be more verbose in its assembly output.
Listing 1.34: Non-optimizing GCC 4.4.5 (assembly output)
It’s clear that the functions generating these listings aren’t very important to GCC users, which is probably why they still contain some minor visual bugs that haven’t been fixed.
Here, we can see that the FP register is used as a stack frame pointer.
We can also notice three NOPs (empty instructions) — the second and third of them appear after branch instructions.
It’s possible that GCC always places NOPs after branch instructions (because of branch delay slots). When optimization is enabled, it removes them — but in this case, it left them as they are.
Listing 1.35: Non-optimizing GCC 4.4.5 (IDA)
The interesting part here is that IDA recognized that the two instructions LUI and ADDIU together actually perform a Load Address operation, so it combined them into a single pseudo-instruction called LA.
This isn’t a real instruction in MIPS, but a pseudo-instruction representing both combined.
The LA instruction occupies 8 bytes, since it is actually composed of two real instructions.
Also, IDA doesn’t display NOPs as NOP, so they appear as OR $AT, $ZERO.
That means it’s performing an OR between register $AT and zero — so the result remains unchanged, effectively doing nothing.
Like several other architectures, MIPS doesn’t have a real NOP instruction, so it uses an OR operation like this as a replacement.
The Role of the Stack Frame in This Example
The text string’s address is passed through a register — so why do we even need a local stack?
The reason is that the values of RA and GP must be stored somewhere before calling printf() (or puts() in this case), because the call might modify them.
So, we use the stack to save them.
If this function were a leaf function (meaning it doesn’t call any other functions), we could skip the entire prologue and epilogue altogether.
Optimizing GCC: load it into GDB
GCC that performs Optimization: loading it into GDB
Listing 1.36: sample GDB session
The GCC that performs optimization (Optimizing GCC) is executed with this code inside GDB.
In this example, we see how the code is executed step by step inside the debugger.
We start compiling with:
This compiles the file hw.c with the highest optimization level (-O3).
Then we run the debugger:
And we set a breakpoint at the beginning of main:
After the program starts, we look at the assembler code (the code generated by the compiler).
disas (or disassemble) prints the machine code generated for the function main.
This part of the code:
This is what calls the printf() (or puts() depending on the program) function.
The register a0 holds the address of the string "hello, world".
This is confirmed when we type:
Here, GDB displays the text stored at the address held in the a0 register.
1.5.5 Summary:
The main difference between x86/ARM and x64/ARM64 code is that the pointer that points to the string became 64-bit instead of 32-bit.
This is because modern processors have become 64-bit, since memory is now cheaper and programs require more of it.
As a result, computers today can install much more memory than what 32-bit pointers could address.
Therefore, all pointers are now 64-bit.
