0xV3n0m

Announcement

Welcome To My Personal Blog

1.23.1 strlen()

Let us talk about loops again. Very often, the strlen() function is implemented using a while() statement.

If you do not know what strlen() does, it simply walks character by character until it finds a zero (NULL terminator), stops, and returns the number of characters before it.

Here is an example:

Imagine the string as a row of characters in memory:


h  e  l  l  o  !  \0

The last character is always a NULL character:


\0 = 0

Back to our topic.

This is what happens in the standard MSVC libraries:


int my_strlen(const char * str)
{
    const char *eos = str;
    while( *eos++ ) ;
    return( eos - str -1 );
}
int main()
{
// test
return my_strlen(“hello!”);
};

x86: Non-optimizing MSVC

Let us compile:


_eos$ = -4                               ; local variable eos
_str$ = 8                                ; parameter str
_strlen PROC
    push    ebp
    mov     ebp, esp
    push    ecx                          ; allocate space for eos
1
mov     eax, DWORD PTR _str$[ebp]    ; load pointer to string into EAX
2
mov     DWORD PTR _eos$[ebp], eax     ; store in eos

 $LN2@strlen_: mov ecx, DWORD PTR _eos$ [ebp]    ; ECX = eos
1
; load 8-bit byte from address in ECX and sign-extend to 32-bit into EDX
2
movsx   edx, BYTE PTR [ecx]
3

4
mov     eax, DWORD PTR _eos$[ebp]    ; EAX = eos
5
add     eax, 1                       ; increment EAX
6
mov     DWORD PTR _eos$[ebp], eax    ; store back to eos
7

8
test    edx, edx                     ; is EDX zero?
9
je      SHORT $LN1@strlen_           ; yes, exit loop
10

11
jmp     SHORT $LN2@strlen_           ; continue loop

 $LN1@strlen_: ; calculate difference between pointers mov eax, DWORD PTR _eos$ [ebp]
sub     eax, DWORD PTR _str$[ebp]
sub     eax, 1                       ; subtract 1 and return result
1
mov     esp, ebp
2
pop     ebp
3
ret     0

_strlen ENDP

We encountered two new instructions here: MOVSX and TEST.

The first — MOVSX — takes a byte from a memory address and stores the value in a 32-bit register.

MOVSX means MOV with Sign-Extend.

MOVSX sets the rest of the bits, from bit 8 to bit 31, to 1 if the loaded byte was negative, or to 0 if positive. This is the reason.

By default, the char type is signed in MSVC and GCC.

If we have two values, one of type char and the other of type int (also signed), and the first contains -2 (represented as 0xFE), and we simply copy this byte into int, it will become 0x000000FE, which from the perspective of signed int equals 254, not -2. In signed int, -2 is represented as 0xFFFFFFFE.

So if we need to transfer 0xFE from a variable of type char to a variable of type int, we need to identify its sign and extend it.

This is what MOVSX does. It is hard to say whether the compiler needed to place the char variable in EDX; it could have taken just the 8-bit part of the register (like DL). It is obvious that the register allocator in the compiler works this way.

Then we see TEST EDX, EDX. You can read more about the TEST

Here this instruction simply checks whether the value in EDX equals zero or not.

Non-optimizing GCC

Let us try GCC 4.4.1:


public strlen
strlen proc near
eos   = dword ptr -4
arg_0 = dword ptr  8
1
push    ebp
2
mov     ebp, esp
3
sub     esp, 10h
4

5
mov     eax, [ebp+arg_0]
6
mov     [ebp+eos], eax

loc_80483F0:
mov     eax, [ebp+eos]
movzx   eax, byte ptr [eax]          ; zero-extend byte to 32-bit
test    al, al
setnz   al                           ; set AL to 1 if not zero
add     [ebp+eos], 1
test    al, al
jnz     short loc_80483F0
1
mov     edx, [ebp+eos]
2
mov     eax, [ebp+arg_0]
3
mov     ecx, edx
4
sub     ecx, eax
5
mov     eax, ecx
6
sub     eax, 1
7

8
leave
9
retn

strlen endp

The result is almost the same as in MSVC, but here we see MOVZX instead of MOVSX.

MOVZX means MOV with Zero-Extend.

This instruction copies an 8-bit or 16-bit value into a 32-bit register and sets the rest of the bits to 0.

In fact, this instruction is useful only to replace a pair of instructions like this:


xor eax, eax
mov al, [...]

On the other hand, it is obvious that the compiler could generate code like this:


mov  al, byte ptr [eax]
test al, al

This is almost the same thing, but the upper bits in the EAX register would contain random noise.

The author said this is a flaw in the compiler — it is unable to generate clearer code. Honestly, the compiler is not obligated to generate code understandable to humans at all.

The new instruction for us next is SETNZ.

Here, if AL is not zero, test al, al sets the ZF flag to 0, but SETNZ, if ZF==0 (NZ means not zero), sets AL = 1.

In simple terms, if AL is not zero, jump to loc_80483F0.

The compiler generated somewhat redundant code, but let us not forget that optimizations are turned off.

Optimizing MSVC

Now let us compile all this in MSVC 2012 with optimizations turned on (/Ox):


_str$ = 8
_strlen PROC
    mov     edx, DWORD PTR _str$[esp-4]  ; EDX = pointer to string
    mov     eax, edx                     ; EAX = EDX (copy pointer)
 $LL2@strlen: mov cl, BYTE PTR [eax] ; CL = current byte inc eax ; increment pointer test cl, cl ; is current byte zero? jne SHORT$ LL2@strlen            ; no, continue loop
1
sub     eax, edx                     ; calculate difference between pointers
2
dec     eax                          ; decrement result
3
ret     0

_strlen ENDP

Now everything is simpler.

No need to say that the compiler can use registers with such efficiency only in small functions with few local variables.

INC/DEC are increment/decrement instructions, meaning in other words: add or subtract 1 from a variable.

Optimizing MSVC + OllyDbg

First, we compile the C code, then open the EXE file in x32dbg and go to main.

Here we see that x32dbg detected the loop and for convenience wrapped its instructions in brackets.

If we right-click on EAX, we can choose "Follow in Dump", and the memory window will scroll to the correct place.

x32dbg dump showing 'hello!' string with NULL terminator

Here we can see the "hello!" string in memory. There is at least one zero byte after it, followed by random garbage. If x32dbg sees a register containing a valid address pointing to a string, it displays it as a string.

Let us press F8 (step over) a few times until we reach the beginning of the loop body:

x32dbg at loop start, EAX points to second character

We see that EAX contains the address of the second character in the string.

Then we continue pressing F8 a few more times until we exit the loop:

x32dbg after loop, EAX points after NULL terminator

Here we see that EAX now contains the address of the zero byte after the string plus 1 (because INC EAX was executed regardless of whether we exit the loop or not). At the same time, EDX did not change, so it still points to the beginning of the string.

The difference between these addresses is calculated now.

Here the SUB instruction has just been executed:

The difference between the pointers is now in the EAX register — 7. Indeed, the length of the "hello!" string is 6, but with the zero byte it is 7. But strlen() must return the number of non-zero characters in the string.

So a decrement occurs, and then the function returns.

Optimizing GCC

Let us see GCC 4.4.1 with optimizations turned on (-O3):


public strlen
strlen proc near
arg_0 = dword ptr  8
1
push    ebp
2
mov     ebp, esp
3
mov     ecx, [ebp+arg_0]             ; ECX = input pointer (str)
4
mov     eax, ecx                     ; EAX = ECX

loc_8048418:
movzx   edx, byte ptr [eax]          ; zero-extend current byte to EDX
add     eax, 1                       ; increment pointer
test    dl, dl                       ; is current byte zero?
jnz     short loc_8048418            ; no, continue loop
1
not     ecx                          ; invert all bits in ECX
2
add     eax, ecx                     ; EAX = EAX + ECX
3
pop     ebp
4
retn

strlen endp

Here GCC is similar to MSVC, except for the presence of MOVZX.

But here MOVZX could be replaced with:


mov dl, byte ptr [eax]

Perhaps this is simpler for GCC's code generator to remember that the entire 32-bit EDX register is reserved for the char variable, thus ensuring the upper bits have no "noise" at any point.

Next we see a new instruction — NOT. This instruction inverts all bits in the operand.

You could say it is equivalent to the instruction:


XOR ECX, 0ffffffffh

The NOT instruction and the following ADD calculate the difference between the pointers and subtract 1, but in a different way.

Initially, ECX containing the str pointer is inverted and 1 is subtracted from it.

In other words, at the very end of the function right after the loop body, these operations are executed:


ecx=str;
eax=eos;
ecx=(-ecx)-1;
eax=eax+ecx;
return eax;

... which is functionally equivalent to:


ecx=str;
eax=eos;
eax=eax-ecx;
eax=eax-1;
return eax;

The author asked here why GCC decided this would be better?

His answer was that it is hard to guess. But perhaps both are equivalent in terms of efficiency.

ARM: 32-bit ARM

Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)


_strlen
eos = -8
str = -4
1
SUB     SP, SP, #8                   ; allocate 8 bytes for locals
2
STR     R0, [SP,#8+str]              ; store input pointer in str
3
LDR     R0, [SP,#8+str]
4
STR     R0, [SP,#8+eos]              ; copy to eos

loc_2CB8:                                    ; loop body start
LDR     R0, [SP,#8+eos]              ; load eos
ADD     R1, R0, #1                   ; increment
STR     R1, [SP,#8+eos]              ; store back
LDRSB   R0, [R0]                     ; load signed byte and sign-extend
CMP     R0, #0
BEQ     loc_2CD4                     ; exit if zero
B       loc_2CB8                     ; continue loop
loc_2CD4:
LDR     R0, [SP,#8+eos]              ; load eos
LDR     R1, [SP,#8+str]              ; load str
SUB     R0, R0, R1                   ; R0 = eos - str
SUB     R0, R0, #1                   ; R0 -= 1
ADD     SP, SP, #8                   ; deallocate
BX      LR                           ; return

Non-optimizing LLVM generates a lot of redundant code, but here we can see how the function works with local variables on the stack.

There are only two local variables in our function: eos and str. In this listing from IDA, we manually renamed var_8 and var_4 to eos and str.

The first instructions simply save the input value in both str and eos.

The loop body starts at label loc_2CB8.

The first 3 instructions in the loop body (LDR, ADD, STR) load the value of eos into R0. Then the value is incremented and stored back in eos on the stack.

The next instruction LDRSB R0, [R0] ("Load Register Signed Byte") loads a byte from memory at the address in R0 and sign-extends it to 32-bit. This is similar to the MOVSX instruction in x86.

The compiler treats this byte as signed because the char type is signed according to the C standard. This was written about before regarding x86.

It must be said that it is impossible to use an 8-bit or 16-bit part of a 32-bit register in ARM separately from the whole register, as is done in x86.

It is obvious that this is because x86 has a huge history of backward compatibility with its ancestors down to 16-bit 8086 and even 8-bit 8080, but ARM was created from scratch as a 32-bit RISC processor.

Therefore, to process individual bytes in ARM, 32-bit registers must still be used.

So LDRSB loads bytes from the string into R0 one by one.

The following CMP and BEQ instructions check whether the loaded byte equals 0 or not.

If not 0, control returns to the beginning of the loop body. If 0, the loop ends.

At the end of the function, the difference between eos and str is calculated, 1 is subtracted from it, and the resulting value is returned via R0.

Note: no registers are saved in this function.

This is because in the ARM calling convention, registers R0-R3 are called "scratch registers", designed for passing arguments, and we are not required to restore their values when the function ends, because the calling function will not use them again. Therefore, we can use them for anything we want.

No other registers were used here, so nothing is saved on the stack.

Therefore, control can return to the calling function with a simple jump (BX) to the address in the LR register.

Optimizing Xcode 4.6.3 (LLVM) (Thumb mode)


_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)
loc_2DF6:
LDRB.W  R2, [R1],#1                  ; load byte and increment R1
CMP     R2, #0
BNE     loc_2DF6                     ; continue if not zero
MVNS    R0, R0                       ; invert bits (like NOT)
ADD     R0, R1                       ; R0 = R1 + R0
BX      LR                           ; return

As optimizing LLVM concluded, eos and str do not need stack space and can always be in registers.

Before the loop body starts, str is always in R0, and eos in R1.

The instruction LDRB.W R2, [R1],#1 loads a byte from memory at the address in R1 into R2, and sign-extends it to a 32-bit value, but not only that.

The #1 at the end of the instruction means "Post-indexed addressing", meaning 1 will be added to R1 after the byte is loaded.

Then we see CMP and BNE in the loop body, these instructions continue the loop until a 0 is found in the string.

MVNS (inverts all bits, like NOT in x86) and the ADD instruction calculate eos − str − 1.

In fact, these instructions calculate R0 = str + eos, which is functionally equivalent to what was in the source code

It is obvious that LLVM, like GCC, concluded that this code can be shorter (or faster).

Optimizing Keil 6/2013 (ARM mode)


_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)
loc_2C8:
LDRB    R2, [R1],#1                  ; load byte and increment R1
CMP     R2, #0
SUBEQ   R0, R1, R0                   ; if zero, R0 = R1 - R0
SUBEQ   R0, R0, #1                   ; R0 -= 1
BNE     loc_2C8                      ; continue if not zero
BX      LR                           ; return

Almost the same as what we saw before, with the difference that the expression str − eos − 1 can be calculated not at the end of the function, but inside the loop body itself.

The suffix -EQ as we remember means that the instruction executes only if the operands in the previous CMP were equal.

Thus, if R0 contains 0, the two SUBEQ instructions execute and the result remains in register R0.

ARM64

Optimizing GCC (Linaro) 4.9


my_strlen:
    mov     x1, x0                       ; X1 = input pointer (eos, cursor)
.L58:
ldrb    w2, [x1],1                   ; load byte and increment X1
1
cbnz    w2, .L58                     ; continue if not zero
2

3
sub     x0, x1, x0                   ; calculate difference
4
sub     w0, w0, #1                   ; decrement lower 32 bits
5
ret

This algorithm we saw before: find zero byte, calculate difference between pointers, subtract 1 from result. Some comments added by the book author.

It is worth noting that our example is slightly wrong:

my_strlen() returns a 32-bit int, but it should return size_t or another 64-bit type.

The reason is that theoretically strlen() can be called on huge memory blocks larger than 4GB, so it must be able to return a 64-bit value on 64-bit platforms.

Due to my mistake, the last SUB instruction operates on the lower 32-bit part of the register, while the previous one operates on the entire 64-bit register (calculating the difference between pointers).

This is my mistake, and it is better to leave it as is as an example of what the code looks like in this case.

Non-optimizing GCC (Linaro) 4.9


my_strlen:
    sub     sp, sp, #32                  ; allocate stack space
1
str     x0, [sp,8]                   ; store input (str)
2
ldr     x0, [sp,8]
3

4
str     x0, [sp,24]                  ; copy to eos
5
nop

.L62:
ldr     x0, [sp,24]                  ; load eos
add     x1, x0, 1                    ; increment
str     x1, [sp,24]                  ; store back
1
ldrb    w0, [x0]                     ; load byte
2

3
cmp     w0, wzr                      ; compare with zero
4
bne     .L62                         ; continue if not zero
5

6
ldr     x1, [sp,24]                  ; load eos
7
ldr     x0, [sp,8]                   ; load str
8
sub     x0, x1, x0                   ; calculate difference
9
sub     w0, w0, #1                   ; decrement lower 32 bits
10

11
add     sp, sp, 32                   ; deallocate
12
ret

This is longer. Variables here are moved back and forth between memory (local stack) a lot.

Same mistake here: the decrement operation is performed on the lower 32-bit part of the register.

MIPS


my_strlen:
    move    $v1, $a0                     ; eos = str
loc_4:
lb       $a1, 0($ v1)                  ; load byte into  $a1 or$ at, $zero                   ; NOP (delay slot)
1
bnez    $a1, loc_4                   ; continue if not zero
2

3
addiu   $v1, 1                       ; increment eos (delay slot)
4

5
nor     $v0, $zero, $a0              ; invert str bits
6

7
jr      $ra                          ; return
8
addu    $v0, $v1, $v0                ; result = eos + (-str-1) = eos - str - 1 (delay slot)

MIPS has no NOT instruction, but it has NOR which is OR + NOT.

This operation is used a lot in digital electronics.

For example, the Apollo Guidance Computer used in the Apollo program was built using only 5600 NOR gates.

But the NOR element is not very popular in computer programming.

Therefore, the NOT operation is implemented here as: NOR DST, $ZERO, SRC

(Inverting the bits of a signed number is the same as changing its sign and subtracting 1 from the result.)

So what NOT does here is take the value of str and turn it into −str − 1.

The following addition operation prepares the result.

Share

If this article helped you, please share it with others!

CH1.21 - More About Strings

https://v3nn00m.github.io/posts/re4b/chapter1_23/

Author

0xV3n0m

Published at

2026-01-30

License

0xV3n0m's Personal Blog License

Some information may be outdated

CH1.22 - Replacing Arithmetic Instructions to Other Ones

CH1.20 - switch()/case/default (Part 2)

0xV3n0m

1.23.1 strlen()

x86: Non-optimizing MSVC

Non-optimizing GCC

Optimizing MSVC

Optimizing MSVC + OllyDbg

Optimizing GCC

ARM: 32-bit ARM

Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)

Optimizing Xcode 4.6.3 (LLVM) (Thumb mode)

Optimizing Keil 6/2013 (ARM mode)

ARM64

Optimizing GCC (Linaro) 4.9

Non-optimizing GCC (Linaro) 4.9

MIPS

Table of Contents