mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3
2885 words
14 minutes
CH1.21 - More About Strings
2026-01-30

1.23.1 strlen()

strings

Let us talk about loops again. Very often, the strlen() function is implemented using a while() statement.

If you do not know what strlen() does, it simply walks character by character until it finds a zero (NULL terminator), stops, and returns the number of characters before it.

Here is an example:

Imagine the string as a row of characters in memory:


h  e  l  l  o  !  \0

  

The last character is always a NULL character:


\0 = 0

  

Back to our topic.

This is what happens in the standard MSVC libraries:


int my_strlen(const char * str)
{
    const char *eos = str;
    while( *eos++ ) ;
    return( eos - str -1 );
}

int main() { // test return my_strlen(“hello!”); };

x86: Non-optimizing MSVC

Let us compile:


_eos$ = -4                               ; local variable eos
_str$ = 8                                ; parameter str
_strlen PROC
    push    ebp
    mov     ebp, esp
    push    ecx                          ; allocate space for eos
mov eax, DWORD PTR _str$[ebp] ; load pointer to string into EAX
mov DWORD PTR _eos$[ebp], eax ; store in eos

LN2@strlen:movecx,DWORDPTReosLN2@strlen_: mov ecx, DWORD PTR _eos[ebp] ; ECX = eos

; load 8-bit byte from address in ECX and sign-extend to 32-bit into EDX
movsx edx, BYTE PTR [ecx]
mov eax, DWORD PTR _eos$[ebp] ; EAX = eos
add eax, 1 ; increment EAX
mov DWORD PTR _eos$[ebp], eax ; store back to eos
test edx, edx ; is EDX zero?
je SHORT $LN1@strlen_ ; yes, exit loop
jmp SHORT $LN2@strlen_ ; continue loop

LN1@strlen:;calculatedifferencebetweenpointersmoveax,DWORDPTReosLN1@strlen_: ; calculate difference between pointers mov eax, DWORD PTR _eos[ebp] sub eax, DWORD PTR _str$[ebp] sub eax, 1 ; subtract 1 and return result

mov esp, ebp
pop ebp
ret 0

_strlen ENDP

We encountered two new instructions here: MOVSX and TEST.

The first — MOVSX — takes a byte from a memory address and stores the value in a 32-bit register.

MOVSX means MOV with Sign-Extend.

MOVSX sets the rest of the bits, from bit 8 to bit 31, to 1 if the loaded byte was negative, or to 0 if positive. This is the reason.

By default, the char type is signed in MSVC and GCC.

If we have two values, one of type char and the other of type int (also signed), and the first contains -2 (represented as 0xFE), and we simply copy this byte into int, it will become 0x000000FE, which from the perspective of signed int equals 254, not -2. In signed int, -2 is represented as 0xFFFFFFFE.

So if we need to transfer 0xFE from a variable of type char to a variable of type int, we need to identify its sign and extend it.

This is what MOVSX does. It is hard to say whether the compiler needed to place the char variable in EDX; it could have taken just the 8-bit part of the register (like DL). It is obvious that the register allocator in the compiler works this way.

Then we see TEST EDX, EDX. You can read more about the TEST

Here this instruction simply checks whether the value in EDX equals zero or not.

Non-optimizing GCC

Let us try GCC 4.4.1:


public strlen
strlen proc near

eos = dword ptr -4 arg_0 = dword ptr 8

push ebp
mov ebp, esp
sub esp, 10h
mov eax, [ebp+arg_0]
mov [ebp+eos], eax

loc_80483F0: mov eax, [ebp+eos] movzx eax, byte ptr [eax] ; zero-extend byte to 32-bit test al, al setnz al ; set AL to 1 if not zero add [ebp+eos], 1 test al, al jnz short loc_80483F0

mov edx, [ebp+eos]
mov eax, [ebp+arg_0]
mov ecx, edx
sub ecx, eax
mov eax, ecx
sub eax, 1
leave
retn

strlen endp

The result is almost the same as in MSVC, but here we see MOVZX instead of MOVSX.

MOVZX means MOV with Zero-Extend.

This instruction copies an 8-bit or 16-bit value into a 32-bit register and sets the rest of the bits to 0.

In fact, this instruction is useful only to replace a pair of instructions like this:


xor eax, eax
mov al, [...]

  

On the other hand, it is obvious that the compiler could generate code like this:


mov  al, byte ptr [eax]
test al, al

  

This is almost the same thing, but the upper bits in the EAX register would contain random noise.

The author said this is a flaw in the compiler — it is unable to generate clearer code. Honestly, the compiler is not obligated to generate code understandable to humans at all.

The new instruction for us next is SETNZ.

Here, if AL is not zero, test al, al sets the ZF flag to 0, but SETNZ, if ZF==0 (NZ means not zero), sets AL = 1.

In simple terms, if AL is not zero, jump to loc_80483F0.

The compiler generated somewhat redundant code, but let us not forget that optimizations are turned off.

Optimizing MSVC

Now let us compile all this in MSVC 2012 with optimizations turned on (/Ox):


_str$ = 8
_strlen PROC
    mov     edx, DWORD PTR _str$[esp-4]  ; EDX = pointer to string
    mov     eax, edx                     ; EAX = EDX (copy pointer)

LL2@strlen:movcl,BYTEPTR[eax];CL=currentbyteinceax;incrementpointertestcl,cl;iscurrentbytezero?jneSHORTLL2@strlen: mov cl, BYTE PTR [eax] ; CL = current byte inc eax ; increment pointer test cl, cl ; is current byte zero? jne SHORT LL2@strlen ; no, continue loop

sub eax, edx ; calculate difference between pointers
dec eax ; decrement result
ret 0

_strlen ENDP

Now everything is simpler.

No need to say that the compiler can use registers with such efficiency only in small functions with few local variables.

INC/DEC are increment/decrement instructions, meaning in other words: add or subtract 1 from a variable.

Optimizing MSVC + OllyDbg

First, we compile the C code, then open the EXE file in x32dbg and go to main.

Here we see that x32dbg detected the loop and for convenience wrapped its instructions in brackets.

If we right-click on EAX, we can choose "Follow in Dump", and the memory window will scroll to the correct place.

x32dbg dump showing 'hello!' string with NULL terminator

Here we can see the "hello!" string in memory. There is at least one zero byte after it, followed by random garbage. If x32dbg sees a register containing a valid address pointing to a string, it displays it as a string.

Let us press F8 (step over) a few times until we reach the beginning of the loop body:

x32dbg at loop start, EAX points to second character

We see that EAX contains the address of the second character in the string.

Then we continue pressing F8 a few more times until we exit the loop:

x32dbg after loop, EAX points after NULL terminator

Here we see that EAX now contains the address of the zero byte after the string plus 1 (because INC EAX was executed regardless of whether we exit the loop or not). At the same time, EDX did not change, so it still points to the beginning of the string.

The difference between these addresses is calculated now.

Here the SUB instruction has just been executed:

x32dbg after loop, EAX points after NULL terminator

The difference between the pointers is now in the EAX register — 7. Indeed, the length of the "hello!" string is 6, but with the zero byte it is 7. But strlen() must return the number of non-zero characters in the string.

So a decrement occurs, and then the function returns.

Optimizing GCC

Let us see GCC 4.4.1 with optimizations turned on (-O3):


public strlen
strlen proc near

arg_0 = dword ptr 8

push ebp
mov ebp, esp
mov ecx, [ebp+arg_0] ; ECX = input pointer (str)
mov eax, ecx ; EAX = ECX

loc_8048418: movzx edx, byte ptr [eax] ; zero-extend current byte to EDX add eax, 1 ; increment pointer test dl, dl ; is current byte zero? jnz short loc_8048418 ; no, continue loop

not ecx ; invert all bits in ECX
add eax, ecx ; EAX = EAX + ECX
pop ebp
retn

strlen endp

Here GCC is similar to MSVC, except for the presence of MOVZX.

But here MOVZX could be replaced with:


mov dl, byte ptr [eax]

  

Perhaps this is simpler for GCC's code generator to remember that the entire 32-bit EDX register is reserved for the char variable, thus ensuring the upper bits have no "noise" at any point.

Next we see a new instruction — NOT. This instruction inverts all bits in the operand.

You could say it is equivalent to the instruction:


XOR ECX, 0ffffffffh

  

The NOT instruction and the following ADD calculate the difference between the pointers and subtract 1, but in a different way.

Initially, ECX containing the str pointer is inverted and 1 is subtracted from it.

In other words, at the very end of the function right after the loop body, these operations are executed:


ecx=str;
eax=eos;
ecx=(-ecx)-1;
eax=eax+ecx;
return eax;

  

... which is functionally equivalent to:


ecx=str;
eax=eos;
eax=eax-ecx;
eax=eax-1;
return eax;

  

The author asked here why GCC decided this would be better?

His answer was that it is hard to guess. But perhaps both are equivalent in terms of efficiency.

ARM: 32-bit ARM

Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)


_strlen
eos = -8
str = -4
SUB SP, SP, #8 ; allocate 8 bytes for locals
STR R0, [SP,#8+str] ; store input pointer in str
LDR R0, [SP,#8+str]
STR R0, [SP,#8+eos] ; copy to eos

loc_2CB8: ; loop body start LDR R0, [SP,#8+eos] ; load eos ADD R1, R0, #1 ; increment STR R1, [SP,#8+eos] ; store back LDRSB R0, [R0] ; load signed byte and sign-extend CMP R0, #0 BEQ loc_2CD4 ; exit if zero B loc_2CB8 ; continue loop

loc_2CD4: LDR R0, [SP,#8+eos] ; load eos LDR R1, [SP,#8+str] ; load str SUB R0, R0, R1 ; R0 = eos - str SUB R0, R0, #1 ; R0 -= 1 ADD SP, SP, #8 ; deallocate BX LR ; return

Non-optimizing LLVM generates a lot of redundant code, but here we can see how the function works with local variables on the stack.

There are only two local variables in our function: eos and str. In this listing from IDA, we manually renamed var_8 and var_4 to eos and str.

The first instructions simply save the input value in both str and eos.

The loop body starts at label loc_2CB8.

The first 3 instructions in the loop body (LDR, ADD, STR) load the value of eos into R0. Then the value is incremented and stored back in eos on the stack.

The next instruction LDRSB R0, [R0] ("Load Register Signed Byte") loads a byte from memory at the address in R0 and sign-extends it to 32-bit. This is similar to the MOVSX instruction in x86.

The compiler treats this byte as signed because the char type is signed according to the C standard. This was written about before regarding x86.

It must be said that it is impossible to use an 8-bit or 16-bit part of a 32-bit register in ARM separately from the whole register, as is done in x86.

It is obvious that this is because x86 has a huge history of backward compatibility with its ancestors down to 16-bit 8086 and even 8-bit 8080, but ARM was created from scratch as a 32-bit RISC processor.

Therefore, to process individual bytes in ARM, 32-bit registers must still be used.

So LDRSB loads bytes from the string into R0 one by one.

The following CMP and BEQ instructions check whether the loaded byte equals 0 or not.

If not 0, control returns to the beginning of the loop body. If 0, the loop ends.

At the end of the function, the difference between eos and str is calculated, 1 is subtracted from it, and the resulting value is returned via R0.

Note: no registers are saved in this function.

This is because in the ARM calling convention, registers R0-R3 are called "scratch registers", designed for passing arguments, and we are not required to restore their values when the function ends, because the calling function will not use them again. Therefore, we can use them for anything we want.

No other registers were used here, so nothing is saved on the stack.

Therefore, control can return to the calling function with a simple jump (BX) to the address in the LR register.

Optimizing Xcode 4.6.3 (LLVM) (Thumb mode)


_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)

loc_2DF6: LDRB.W R2, [R1],#1 ; load byte and increment R1 CMP R2, #0 BNE loc_2DF6 ; continue if not zero MVNS R0, R0 ; invert bits (like NOT) ADD R0, R1 ; R0 = R1 + R0 BX LR ; return

As optimizing LLVM concluded, eos and str do not need stack space and can always be in registers.

Before the loop body starts, str is always in R0, and eos in R1.

The instruction LDRB.W R2, [R1],#1 loads a byte from memory at the address in R1 into R2, and sign-extends it to a 32-bit value, but not only that.

The #1 at the end of the instruction means "Post-indexed addressing", meaning 1 will be added to R1 after the byte is loaded.

Then we see CMP and BNE in the loop body, these instructions continue the loop until a 0 is found in the string.

MVNS (inverts all bits, like NOT in x86) and the ADD instruction calculate eos − str − 1.

In fact, these instructions calculate R0 = str + eos, which is functionally equivalent to what was in the source code

It is obvious that LLVM, like GCC, concluded that this code can be shorter (or faster).

Optimizing Keil 6/2013 (ARM mode)


_strlen
    MOV     R1, R0                       ; R1 = input pointer (eos)

loc_2C8: LDRB R2, [R1],#1 ; load byte and increment R1 CMP R2, #0 SUBEQ R0, R1, R0 ; if zero, R0 = R1 - R0 SUBEQ R0, R0, #1 ; R0 -= 1 BNE loc_2C8 ; continue if not zero BX LR ; return

Almost the same as what we saw before, with the difference that the expression str − eos − 1 can be calculated not at the end of the function, but inside the loop body itself.

The suffix -EQ as we remember means that the instruction executes only if the operands in the previous CMP were equal.

Thus, if R0 contains 0, the two SUBEQ instructions execute and the result remains in register R0.

ARM64

Optimizing GCC (Linaro) 4.9


my_strlen:
    mov     x1, x0                       ; X1 = input pointer (eos, cursor)

.L58: ldrb w2, [x1],1 ; load byte and increment X1

cbnz w2, .L58 ; continue if not zero
sub x0, x1, x0 ; calculate difference
sub w0, w0, #1 ; decrement lower 32 bits
ret

This algorithm we saw before: find zero byte, calculate difference between pointers, subtract 1 from result. Some comments added by the book author.

It is worth noting that our example is slightly wrong:

my_strlen() returns a 32-bit int, but it should return size_t or another 64-bit type.

The reason is that theoretically strlen() can be called on huge memory blocks larger than 4GB, so it must be able to return a 64-bit value on 64-bit platforms.

Due to my mistake, the last SUB instruction operates on the lower 32-bit part of the register, while the previous one operates on the entire 64-bit register (calculating the difference between pointers).

This is my mistake, and it is better to leave it as is as an example of what the code looks like in this case.

Non-optimizing GCC (Linaro) 4.9


my_strlen:
    sub     sp, sp, #32                  ; allocate stack space
str x0, [sp,8] ; store input (str)
ldr x0, [sp,8]
str x0, [sp,24] ; copy to eos
nop

.L62: ldr x0, [sp,24] ; load eos add x1, x0, 1 ; increment str x1, [sp,24] ; store back

ldrb w0, [x0] ; load byte
cmp w0, wzr ; compare with zero
bne .L62 ; continue if not zero
ldr x1, [sp,24] ; load eos
ldr x0, [sp,8] ; load str
sub x0, x1, x0 ; calculate difference
sub w0, w0, #1 ; decrement lower 32 bits
add sp, sp, 32 ; deallocate
ret

This is longer. Variables here are moved back and forth between memory (local stack) a lot.

Same mistake here: the decrement operation is performed on the lower 32-bit part of the register.

MIPS


my_strlen:
    move    $v1, $a0                     ; eos = str

loc_4: lb a1,0(a1, 0(v1) ; load byte into a1ora1 or at, $zero ; NOP (delay slot)

bnez $a1, loc_4 ; continue if not zero
addiu $v1, 1 ; increment eos (delay slot)
nor $v0, $zero, $a0 ; invert str bits
jr $ra ; return
addu $v0, $v1, $v0 ; result = eos + (-str-1) = eos - str - 1 (delay slot)

MIPS has no NOT instruction, but it has NOR which is OR + NOT.

This operation is used a lot in digital electronics.

For example, the Apollo Guidance Computer used in the Apollo program was built using only 5600 NOR gates.

But the NOR element is not very popular in computer programming.

Therefore, the NOT operation is implemented here as: NOR DST, $ZERO, SRC

(Inverting the bits of a signed number is the same as changing its sign and subtracting 1 from the result.)

So what NOT does here is take the value of str and turn it into −str − 1.

The following addition operation prepares the result.

Share

If this article helped you, please share it with others!

CH1.21 - More About Strings
https://v3nn00m.github.io/posts/re4b/chapter1_23/
Author
0xV3n0m
Published at
2026-01-30

Some information may be outdated

Table of Contents