【glibC阅读】之 strlen实现

辛昕

【glibC阅读】之 strlen实现 [复制链接]

本帖最后由辛昕于 2018-1-8 00:43 编辑

写在前面的话：
      其实不管是 strlen的实现，甚至是glibc的实现，对我来说其实都不是什么有多大实际价值的事情。我只是意识到，如果我真的要学会阅读代码，我必须真的去尝试阅读代码。
      而选择glibc，是因为它是一个通用的标准库，而且与我常年使用C有密切关系。
      这个帖子的内容，对于那些早已经熟悉这些套路的人（x86汇编和gcc大神）来说，简直小儿科到不足一提。
      不过，我相信对于大多数和我这样，对此几乎一窍不通的小白来说，这个过程本身是很有参考价值的。而我写下这个过程，本身也是为我逐渐积累阅读较大型代码库的能力和经验做准备。希望没有破坏什么人的雅兴和胃口。

      此前发的帖子，那个strlen的glibc实现分析，太监了很久。周末用了所剩不多的零碎时间，尝试阅读一下。
当然现在也还是不十分明白，但是，先做个记录。

      环境说明：
      1.我用的glibC 源码版本是 2.14
      2.我用的souce insight阅读——否则我接下来叙述的内容，有可能出现部分名词不一致，可能影响理解。

      阅读方式：
      由于我仍然未能理解glibc的项目源码结构，所以我尚不能跑起任何一个小程序exe,我仅仅从代码的依赖关系上去分析。
我的分析起点正是对 strlen() 邮件 go to definition......

辛昕

找下来，总共有四处定义——我不得不说，对于第一次阅读比较大的库的我来说，这是个下马威。

硬着头皮看，其中两处为 .c源文件，两处为.h头文件，以至于我以为两个是宏定义，但后来看并非如此，而是内联函数。

内联函数在cpp里我见得比较多，但在C里实际上见的不多，因为很多时候，IAR也好，MDK也好，都无法使用。

闲话少说，以下四楼，分别列出。

辛昕

本帖最后由辛昕于 2018-1-8 00:26 编辑

// 我根据自己喜欢的代码格式，做了一些很细微的调整，但未动一个字母。

// 1 strlen - Function in Strlen.c (sysdeps\i386) at line 23 (13 lines)
#include <string.h>
size_t strlen (const char *str)
{
int cnt;
asm( "cld\n" /* Search forward. */
/* Some old versions of gas need `repne' instead of `repnz'. */
"repnz\n" /* Look for a zero byte. */
"scasb" /* %0, %1, %3 */ :
"=c" (cnt) : "D" (str), "0" (-1), "a" (0));
return -2 - cnt;
}
libc_hidden_builtin_def (strlen)
// 这段代码，显然是针对x86的实现，一堆asm，鬼看得懂，也不在意，在意的是最后一句，是什么意思
// 同时好奇的是，那其他几个实现是怎么回事

复制代码

辛昕

本帖最后由辛昕于 2018-1-8 00:31 编辑

// *这个文件位于根目录下 string/strlen.c
//2 strlen - Function in Strlen.c (string) at line 29 (78 lines)
#include <string.h>
#include <stdlib.h>
#undef strlen // 标注1
/* Return the length of the null-terminated string STR. Scan for
the null terminator quickly by testing four bytes at a time. */
size_t strlen (str) const char *str;
{
const char *char_ptr;
const unsigned long int *longword_ptr;
unsigned long int longword, himagic, lomagic;
/* Handle the first few characters by reading one character at a time.
Do this until CHAR_PTR is aligned on a longword boundary. */
for (char_ptr = str; ((unsigned long int) char_ptr
& (sizeof (longword) - 1)) != 0;
++char_ptr)
if (*char_ptr == '\0')
return char_ptr - str;
/* All these elucidatory comments refer to 4-byte longwords,
but the theory applies equally well to 8-byte longwords. */
longword_ptr = (unsigned long int *) char_ptr;
/* Bits 31, 24, 16, and 8 of this number are zero. Call these bits
the "holes." Note that there is a hole just to the left of
each byte, with an extra at the end:
bits: 01111110 11111110 11111110 11111111
bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
The 1-bits make sure that carries propagate to the next 0-bit.
The 0-bits provide holes for carries to fall into. */
himagic = 0x80808080L;
lomagic = 0x01010101L;
if (sizeof (longword) > 4)
{
/* 64-bit version of the magic. */
/* Do the shift in two steps to avoid a warning if long has 32 bits. */
himagic = ((himagic << 16) << 16) | himagic;
lomagic = ((lomagic << 16) << 16) | lomagic;
}
if (sizeof (longword) > 8)
abort ();
/* Instead of the traditional loop which tests each character,
we will test a longword at a time. The tricky part is testing
if *any of the four* bytes in the longword in question are zero. */
for (;;)
{
longword = *longword_ptr++;
if (((longword - lomagic) & ~longword & himagic) != 0)
{
/* Which of the bytes was the zero? If none of them were, it was
a misfire; continue the search. */
const char *cp = (const char *) (longword_ptr - 1);
if (cp[0] == 0)
return cp - str;
if (cp[1] == 0)
return cp - str + 1;
if (cp[2] == 0)
return cp - str + 2;
if (cp[3] == 0)
return cp - str + 3;
if (sizeof (longword) > 4)
{
if (cp[4] == 0)
return cp - str + 4;
if (cp[5] == 0)
return cp - str + 5;
if (cp[6] == 0)
return cp - str + 6;
if (cp[7] == 0)
return cp - str + 7;
}
}
}
}
libc_hidden_builtin_def (strlen)
// 这是一段看起来完全与平台无关的代码。似乎可以认为是我们追踪的终点。
// 我唯一有点好奇，相信也是你们好奇的，为什么，我们可以用区区几行代码实现的 strlen() 在这里居然这么复杂！
// 另外，见标注1：这是在干什么？
// 它要取消前面的（宏）定义 strlen,所以几乎可以料想，还有两个 strlen 的宏定义（不要被小写迷惑了？）

复制代码

辛昕

本帖最后由辛昕于 2018-1-8 00:35 编辑

//3 strlen - Macro in String.h (sysdeps\s390\bits) at line 43
// 这段预警告似乎在暗示我们，这个定义是一个替代定义。

#ifndef _STRING_H
# error "Never use <bits/string.h> directly; include <string.h> instead."
#endif

复制代码

// 紧随其后的是一个 strlen 的实现版本

#ifndef _FORCE_INLINES
#define strlen(str) __strlen_g ((str))
__STRING_INLINE size_t __strlen_g (__const char *) __asm__ ("strlen");
// 这段代码的解释：
// __STRING_INLINE 这个没什么，搜索可以看到，它可能有三种定义：1.啥都没；2.inline;3.extern inline
// 虽然我并不理解C下的内联函数是怎么回事，但内联就只是内联而已，没什么特别的。
// 至于 extern inline 我的确并不确切理解这个外部是个什么鬼，但不管如何，也就只是个内联，在
// 这次的阅读里，我真心丝毫不关心。

复制代码

// __asm__ 倒是一个知识点，不过，也没什么特别，它是gcc的一个关键字，意思是接下来要使用汇编代码了。
// 我还是不明白的是，后面跟着一个 ("strlen") 这是个什么操作？
// 百度到一个这样的东西，让我领悟到这个内嵌 asm 的语法
/*
__asm__("mov r0, #0\n"}
*/
// 所以上述的这句话，其实大概率翻译成常见的形式就会是
/*
inline size_t __strlen_g (__const char *)
{
__asm__ ("strlen");
}
我并不明白这个地方为毛这么该死非要挤在一行上看，但显然，至少对于我，这样的格式我更容易看懂
我可以理解在汇编里出现 mov 之类的语句，
但出现一个 strlen 我是理解不了的。
*/

复制代码

__STRING_INLINE size_t __strlen_g (__const char *__str)
{
char *__ptr, *__tmp;
__ptr = (char *) 0;
__tmp = (char *) __str;
__asm__ __volatile__ (" la 0,0\n"
"0: srst %0,%1\n"
" jo 0b\n"
: "+&a" (__ptr), "+&a" (__tmp) :
: "cc", "memory", "0" );
return (size_t) (__ptr - __str);
}
// 这一段对我来说倒没什么太特别的，无非就是一堆x86汇编代码，我看不懂，但反正就是在干strlen该干的事的意思
#endif

复制代码

辛昕

// 先看最后一处 strlen 宏定义
//4 strlen - Macro in String.h (sysdeps\i386\i486\bits) at line 549 (4 lines)
/* Return the length of S. */
#define _HAVE_STRING_ARCH_strlen 1
#define strlen(str) \
(__extension__ (__builtin_constant_p (str) \
? __builtin_strlen (str) \
: __strlen_g (str)))
// 标注2
__STRING_INLINE size_t __strlen_g (__const char *__str);
__STRING_INLINE size_t
__strlen_g (__const char *__str)
{
register char __dummy;
register __const char *__tmp = __str;
__asm__ __volatile__
("1:\n\t"
"movb (%0),%b1\n\t"
"leal 1(%0),%0\n\t"
"testb %b1,%b1\n\t"
"jne 1b"
: "=r" (__tmp), "=&q" (__dummy)
: "0" (__str),
"m" ( *(struct { char __x[0xfffffff]; } *)__str)
: "cc" );
return __tmp - __str - 1;
}
/*
总体而言，这是一个和第三处，其实十分相似的结构。
唯一要理解的是一个新的语法团——标注2
*/

复制代码

辛昕

我最终意识到，要真的继续玩下去。
除了去挖那些很可能是 x86汇编语法或者是 gcc 语法。

否则，真的不知道这一个小小的strlen都能给我闹出4个定义，我鬼知道你最后到底用的是哪个定义啊？
彼此又是什么关系，又为毛要搞得这么复杂？

这些都是接下去要尝试做的事情。