Post

Reverse Engineering for Beginners : More about strings

Reverse Engineering for Beginners : More about strings

1.23.1 strlen()

strings

Let us talk about loops again. Very often, the strlen() function is implemented using a while() statement.

If you do not know what strlen() does, it simply walks character by character until it finds a zero (NULL terminator), stops, and returns the number of characters before it.

Here is an example:

Imagine the string as a row of characters in memory:

Text

h  e  l  l  o  !  \0

  

The last character is always a NULL character:

C

\0 = 0

  

Back to our topic.

This is what happens in the standard MSVC libraries:

C

int my_strlen(const char * str)
{
    const char *eos = str;
    while( *eos++ ) ;
    return( eos - str -1 );
}

int main()
{
    // test
    return my_strlen("hello!");
};

  

x86: Non-optimizing MSVC

Let us compile:

Assembly

_eos$ = -4                               ; local variable eos
_str$ = 8                                ; parameter str
_strlen PROC
    push    ebp
    mov     ebp, esp
    push    ecx                          ; allocate space for eos

    mov     eax, DWORD PTR _str$[ebp]    ; load pointer to string into EAX
    mov     DWORD PTR _eos$[ebp], eax     ; store in eos

$LN2@strlen_:
    mov     ecx, DWORD PTR _eos$[ebp]    ; ECX = eos

    ; load 8-bit byte from address in ECX and sign-extend to 32-bit into EDX
    movsx   edx, BYTE PTR [ecx]

    mov     eax, DWORD PTR _eos$[ebp]    ; EAX = eos
    add     eax, 1                       ; increment EAX
    mov     DWORD PTR _eos$[ebp], eax    ; store back to eos

    test    edx, edx                     ; is EDX zero?
    je      SHORT $LN1@strlen_           ; yes, exit loop

    jmp     SHORT $LN2@strlen_           ; continue loop

$LN1@strlen_:
    ; calculate difference between pointers
    mov     eax, DWORD PTR _eos$[ebp]
    sub     eax, DWORD PTR _str$[ebp]
    sub     eax, 1                       ; subtract 1 and return result

    mov     esp, ebp
    pop     ebp
    ret     0
_strlen ENDP

  

We encountered two new instructions here: MOVSX and TEST.

The first β€” MOVSX β€” takes a byte from a memory address and stores the value in a 32-bit register.

MOVSX means MOV with Sign-Extend.

MOVSX sets the rest of the bits, from bit 8 to bit 31, to 1 if the loaded byte was negative, or to 0 if positive. This is the reason.

By default, the char type is signed in MSVC and GCC.

If we have two values, one of type char and the other of type int (also signed), and the first contains -2 (represented as 0xFE), and we simply copy this byte into int, it will become 0x000000FE, which from the perspective of signed int equals 254, not -2. In signed int, -2 is represented as 0xFFFFFFFE.

So if we need to transfer 0xFE from a variable of type char to a variable of type int, we need to identify its sign and extend it.

This is what MOVSX does. It is hard to say whether the compiler needed to place the char variable in EDX; it could have taken just the 8-bit part of the register (like DL). It is obvious that the register allocator in the compiler works this way.

Then we see TEST EDX, EDX. You can read more about the TEST

Here this instruction simply checks whether the value in EDX equals zero or not.

Non-optimizing GCC

Let us try GCC 4.4.1:

Assembly

public strlen
strlen proc near

eos   = dword ptr -4
arg_0 = dword ptr  8

    push    ebp
    mov     ebp, esp
    sub     esp, 10h

    mov     eax, [ebp+arg_0]
    mov     [ebp+eos], eax

loc_80483F0:
    mov     eax, [ebp+eos]
    movzx   eax, byte ptr [eax]          ; zero-extend byte to 32-bit
    test    al, al
    setnz   al                           ; set AL to 1 if not zero
    add     [ebp+eos], 1
    test    al, al
    jnz     short loc_80483F0

    mov     edx, [ebp+eos]
    mov     eax, [ebp+arg_0]
    mov     ecx, edx
    sub     ecx, eax
    mov     eax, ecx
    sub     eax, 1

    leave
    retn
strlen endp

  

The result is almost the same as in MSVC, but here we see MOVZX instead of MOVSX.

MOVZX means MOV with Zero-Extend.

This instruction copies an 8-bit or 16-bit value into a 32-bit register and sets the rest of the bits to 0.

In fact, this instruction is useful only to replace a pair of instructions like this:

Assembly

xor eax, eax
mov al, [...]

  

On the other hand, it is obvious that the compiler could generate code like this:

Assembly

mov  al, byte ptr [eax]
test al, al

  

This is almost the same thing, but the upper bits in the EAX register would contain random noise.

The author said this is a flaw in the compiler β€” it is unable to generate clearer code. Honestly, the compiler is not obligated to generate code understandable to humans at all.

The new instruction for us next is SETNZ.

Here, if AL is not zero, test al, al sets the ZF flag to 0, but SETNZ, if ZF==0 (NZ means not zero), sets AL = 1.

In simple terms, if AL is not zero, jump to loc_80483F0.

The compiler generated somewhat redundant code, but let us not forget that optimizations are turned off.

Optimizing MSVC

Now let us compile all this in MSVC 2012 with optimizations turned on (/Ox):

Assembly

_str$ = 8
_strlen PROC
    mov     edx, DWORD PTR _str$[esp-4]  ; EDX = pointer to string
    mov     eax, edx                     ; EAX = EDX (copy pointer)

$LL2@strlen:
    mov     cl, BYTE PTR [eax]           ; CL = current byte
    inc     eax                          ; increment pointer
    test    cl, cl                       ; is current byte zero?
    jne     SHORT $LL2@strlen            ; no, continue loop

    sub     eax, edx                     ; calculate difference between pointers
    dec     eax                          ; decrement result
    ret     0
_strlen ENDP

  

Now everything is simpler.

No need to say that the compiler can use registers with such efficiency only in small functions with few local variables.

INC/DEC are increment/decrement instructions, meaning in other words: add or subtract 1 from a variable.

Optimizing MSVC + OllyDbg

First, we compile the C code, then open the EXE file in x32dbg and go to main.

Here we see that x32dbg detected the loop and for convenience wrapped its instructions in brackets.

If we right-click on EAX, we can choose "Follow in Dump", and the memory window will scroll to the correct place.

x32dbg dump showing 'hello!' string with NULL terminator

Here we can see the "hello!" string in memory. There is at least one zero byte after it, followed by random garbage. If x32dbg sees a register containing a valid address pointing to a string, it displays it as a string.

Let us press F8 (step over) a few times until we reach the beginning of the loop body:

x32dbg at loop start, EAX points to second character

We see that EAX contains the address of the second character in the string.

Then we continue pressing F8 a few more times until we exit the loop:

x32dbg after loop, EAX points after NULL terminator

Here we see that EAX now contains the address of the zero byte after the string plus 1 (because INC EAX was executed regardless of whether we exit the loop or not). At the same time, EDX did not change, so it still points to the beginning of the string.

The difference between these addresses is calculated now.

Here the SUB instruction has just been executed:

x32dbg after loop, EAX points after NULL terminator

The difference between the pointers is now in the EAX register β€” 7. Indeed, the length of the "hello!" string is 6, but with the zero byte it is 7. But strlen() must return the number of non-zero characters in the string.

So a decrement occurs, and then the function returns.

Optimizing GCC

Let us see GCC 4.4.1 with optimizations turned on (-O3):

Assembly

public strlen
strlen proc near

arg_0 = dword ptr  8

    push    ebp
    mov     ebp, esp
    mov     ecx, [ebp+arg_0]             ; ECX = input pointer (str)
    mov     eax, ecx                     ; EAX = ECX

loc_8048418:
    movzx   edx, byte ptr [eax]          ; zero-extend current byte to EDX
    add     eax, 1                       ; increment pointer
    test    dl, dl                       ; is current byte zero?
    jnz     short loc_8048418            ; no, continue loop

    not     ecx                          ; invert all bits in ECX
    add     eax, ecx                     ; EAX = EAX + ECX
    pop     ebp
    retn
strlen endp

  

Here GCC is similar to MSVC, except for the presence of MOVZX.

But here MOVZX could be replaced with:

Assembly

mov dl, byte ptr [eax]

  

Perhaps this is simpler for GCC's code generator to remember that the entire 32-bit EDX register is reserved for the char variable, thus ensuring the upper bits have no "noise" at any point.

Next we see a new instruction β€” NOT. This instruction inverts all bits in the operand.

You could say it is equivalent to the instruction:

Assembly

XOR ECX, 0ffffffffh

  

The NOT instruction and the following ADD calculate the difference between the pointers and subtract 1, but in a different way.

Initially, ECX containing the str pointer is inverted and 1 is subtracted from it.

In other words, at the very end of the function right after the loop body, these operations are executed:

C

ecx=str;
eax=eos;
ecx=(-ecx)-1;
eax=eax+ecx;
return eax;

  

... which is functionally equivalent to:

C

ecx=str;
eax=eos;
eax=eax-ecx;
eax=eax-1;
return eax;

  

The author asked here why GCC decided this would be better?

His answer was that it is hard to guess. But perhaps both are equivalent in terms of efficiency.

ARM: 32-bit ARM

Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)

Assembly

_strlen
eos = -8
str = -4

    SUB     SP, SP, #8                   ; allocate 8 bytes for locals
    STR     R0, [SP,#8+str]              ; store input pointer in str
    LDR     R0, [SP,#8+str]
    STR     R0, [SP,#8+eos]              ; copy to eos

loc_2CB8:                                    ; loop body start
    LDR     R0, [SP,#8+eos]              ; load eos
    ADD     R1, R0, #1                   ; increment
    STR     R1, [SP,#8+eos]              ; store back
    LDRSB   R0, [R0]                     ; load signed byte and sign-extend
    CMP     R0, #0
    BEQ     loc_2CD4                     ; exit if zero
    B       loc_2CB8                     ; continue loop

loc_2CD4:
    LDR     R0, [SP,#8+eos]              ; load eos
    LDR     R1, [SP,#8+str]              ; load str
    SUB     R0, R0, R1                   ; R0 = eos - str
    SUB     R0, R0, #1                   ; R0 -= 1
    ADD     SP, SP, #8                   ; deallocate
    BX      LR                           ; return

  

Non-optimizing LLVM generates a lot of redundant code, but here we can see how the function works with local variables on the stack.

There are only two local variables in our function: eos and str. In this listing from IDA, we manually renamed var_8 and var_4 to eos and str.

The first instructions simply save the input value in both str and eos.

The loop body starts at label loc_2CB8.

The first 3 instructions in the loop body (LDR, ADD, STR) load the value of eos into R0. Then the value is incremented and stored back in eos on the stack.

The next instruction LDRSB R0, [R0] ("Load Register Signed Byte") loads a byte from memory at the address in R0 and sign-extends it to 32-bit. This is similar to the MOVSX instruction in x86.

The compiler treats this byte as signed because the char type is signed according to the C standard. This was written about before regarding x86.

It must be said that it is impossible to use an 8-bit or 16-bit part of a 32-bit register in ARM separately from the whole register, as is done in x86.

It is obvious that this is because x86 has a huge history of backward compatibility with its ancestors down to 16-bit 8086 and even 8-bit 8080, but ARM was created from scratch as a 32-bit RISC processor.

Therefore, to process individual bytes in ARM, 32-bit registers must still be used.

So LDRSB loads bytes from the string into R0 one by one.

The following CMP and BEQ instructions check whether the loaded byte equals 0 or not.

If not 0, control returns to the beginning of the loop body. If 0, the loop ends.

At the end of the function, the difference between eos and str is calculated, 1 is subtracted from it, and the resulting value is returned via R0.

Note: no registers are saved in this function.

This is because in the ARM calling convention, registers R0-R3 are called "scratch registers", designed for passing arguments, and we are not required to restore their values when the function ends, because the calling function will not use them again. Therefore, we can use them for anything we want.

No other registers were used here, so nothing is saved on the stack.

Therefore, control can return to the calling function with a simple jump (BX) to the address in the LR register.

Optimizing Xcode 4.6.3 (LLVM) (Thumb mode)

Assembly

_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)

loc_2DF6:
    LDRB.W  R2, [R1],#1                  ; load byte and increment R1
    CMP     R2, #0
    BNE     loc_2DF6                     ; continue if not zero
    MVNS    R0, R0                       ; invert bits (like NOT)
    ADD     R0, R1                       ; R0 = R1 + R0
    BX      LR                           ; return

  

As optimizing LLVM concluded, eos and str do not need stack space and can always be in registers.

Before the loop body starts, str is always in R0, and eos in R1.

The instruction LDRB.W R2, [R1],#1 loads a byte from memory at the address in R1 into R2, and sign-extends it to a 32-bit value, but not only that.

The #1 at the end of the instruction means "Post-indexed addressing", meaning 1 will be added to R1 after the byte is loaded.

Then we see CMP and BNE in the loop body, these instructions continue the loop until a 0 is found in the string.

MVNS (inverts all bits, like NOT in x86) and the ADD instruction calculate eos βˆ’ str βˆ’ 1.

In fact, these instructions calculate R0 = str + eos, which is functionally equivalent to what was in the source code

It is obvious that LLVM, like GCC, concluded that this code can be shorter (or faster).

Optimizing Keil 6/2013 (ARM mode)

Assembly

_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)

loc_2C8:
    LDRB    R2, [R1],#1                  ; load byte and increment R1
    CMP     R2, #0
    SUBEQ   R0, R1, R0                   ; if zero, R0 = R1 - R0
    SUBEQ   R0, R0, #1                   ; R0 -= 1
    BNE     loc_2C8                      ; continue if not zero
    BX      LR                           ; return

  

Almost the same as what we saw before, with the difference that the expression str βˆ’ eos βˆ’ 1 can be calculated not at the end of the function, but inside the loop body itself.

The suffix -EQ as we remember means that the instruction executes only if the operands in the previous CMP were equal.

Thus, if R0 contains 0, the two SUBEQ instructions execute and the result remains in register R0.

ARM64

Optimizing GCC (Linaro) 4.9

Assembly

my_strlen:
    mov     x1, x0                       ; X1 = input pointer (eos, cursor)

.L58:
    ldrb    w2, [x1],1                   ; load byte and increment X1

    cbnz    w2, .L58                     ; continue if not zero

    sub     x0, x1, x0                   ; calculate difference
    sub     w0, w0, #1                   ; decrement lower 32 bits
    ret

  

This algorithm we saw before: find zero byte, calculate difference between pointers, subtract 1 from result. Some comments added by the book author.

It is worth noting that our example is slightly wrong:

my_strlen() returns a 32-bit int, but it should return size_t or another 64-bit type.

The reason is that theoretically strlen() can be called on huge memory blocks larger than 4GB, so it must be able to return a 64-bit value on 64-bit platforms.

Due to my mistake, the last SUB instruction operates on the lower 32-bit part of the register, while the previous one operates on the entire 64-bit register (calculating the difference between pointers).

This is my mistake, and it is better to leave it as is as an example of what the code looks like in this case.

Non-optimizing GCC (Linaro) 4.9

Assembly

my_strlen:
    sub     sp, sp, #32                  ; allocate stack space

    str     x0, [sp,8]                   ; store input (str)
    ldr     x0, [sp,8]

    str     x0, [sp,24]                  ; copy to eos
    nop

.L62:
    ldr     x0, [sp,24]                  ; load eos
    add     x1, x0, 1                    ; increment
    str     x1, [sp,24]                  ; store back

    ldrb    w0, [x0]                     ; load byte

    cmp     w0, wzr                      ; compare with zero
    bne     .L62                         ; continue if not zero

    ldr     x1, [sp,24]                  ; load eos
    ldr     x0, [sp,8]                   ; load str
    sub     x0, x1, x0                   ; calculate difference
    sub     w0, w0, #1                   ; decrement lower 32 bits

    add     sp, sp, 32                   ; deallocate
    ret

  

This is longer. Variables here are moved back and forth between memory (local stack) a lot.

Same mistake here: the decrement operation is performed on the lower 32-bit part of the register.

MIPS

Assembly

my_strlen:
    move    $v1, $a0                     ; eos = str

loc_4:
    lb      $a1, 0($v1)                  ; load byte into $a1
    or      $at, $zero                   ; NOP (delay slot)

    bnez    $a1, loc_4                   ; continue if not zero

    addiu   $v1, 1                       ; increment eos (delay slot)

    nor     $v0, $zero, $a0              ; invert str bits

    jr      $ra                          ; return
    addu    $v0, $v1, $v0                ; result = eos + (-str-1) = eos - str - 1 (delay slot)

  

MIPS has no NOT instruction, but it has NOR which is OR + NOT.

This operation is used a lot in digital electronics.

For example, the Apollo Guidance Computer used in the Apollo program was built using only 5600 NOR gates.

But the NOR element is not very popular in computer programming.

Therefore, the NOT operation is implemented here as: NOR DST, $ZERO, SRC

(Inverting the bits of a signed number is the same as changing its sign and subtracting 1 from the result.)

So what NOT does here is take the value of str and turn it into βˆ’str βˆ’ 1.

The following addition operation prepares the result.

This post is licensed under CC BY 4.0 by the author.

Trending Tags