ADdress
If the regular expression is a$at the end of the expression, then the text matches only if it too is at its end.
Otherwise, if we are not at the end of the text string (that is,*text!='\0') and if the first character of the text string matches the first character of the regular expression, so far so good; we go on to test whether the next character of the regular expression matches the next character of the text by making a recursive call tomatchhere. This recursive call is the heart of the algorithm and is why the code is so compact and clean.
If all of these attempts to match fail, then there can be no match at this point in the regular expression and the text, somatchherereturns 0.
This code really uses C pointers. At each stage of the recursion, if something matches, the recursive call that follows uses pointer arithmetic (e.g.,regexp+1andtext+1) so that the next function is called with the next character of the regular expression and of the text. The depth of recursion is no more than the length of the pattern, which in normal use is quite short, so there is no danger of running out of space.
Alternatives This is a very elegant and well-written piece of code, but it's not perfect. What might one do differently? I might rearrangematchhereto deal with$before*. Although it makes no difference here, it feels a bit more natural, and a good rule is to do easy cases before hard ones.
In general, however, the order of tests is critical. For instance, in this test frommatchstar:
} while (*text != '\0' && (*text++ == c || c == '.'));
no matter what, we must advance over one more character of the text string, so the increment intext++must always be performed. This code is careful about termination conditions. Generally, the success of a match is determined by whether the regular expression runs out at the same time as the text does. If they do run out together, that indicates a match; if one runs out before the other, there is no match. This is perhaps most obvious in a line like
if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; but subtle termination conditions show up in other places as well.
其他方案 这是一段优雅的写的很好的代码,但是它并不完美。有人可能会按照其他的方式来写。在这里我可能重新组织匹配的顺序,将$匹配的探测置于*之前,尽管这两种处理方式本质上并没有区别,但是这样做看起来更自然些。在处理难题之前先做一些容易的事是一个很好的习惯。
一般情况下,匹配测试的顺序是非常重要的。比如下面的frommatchstar测试:
} while (*text != '\0' && (*text++ == c || c == '.')); 无论什么时候,我们必须向后取一个字符,所以增量intext++必须每次都被执行。
这段代码对于终止条件敏感。一般情况下,匹配成功与否是由正则表达式和文本是否同时结束决定,如果正则表达式和文本一起结束,就表明匹配成功,反之如则匹配失败。这和下面的这种写法很像:
if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0';但这种终止条件也可能出现在其他地方。
The version ofmatchstarthat implements leftmost longest matching begins by identifying a maximal sequence of occurrences of the input characterc. Then it usesmatchhereto try to extend the match to the rest of the pattern and the rest of the text. Each failure reduces the number ofc's by one and tries again, including the case of zero occurrences.
/* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}
Consider the regular expression(.*), which matches arbitrary text within parentheses. Given the target text
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)
a longest match from the beginning will identify the entire parenthesized expression, while a shortest match will stop at the first right parenthesis. (Of course a longest match beginning from the second left parenthesis will extend to the end of the text.)
matchstar的这个版本实现了左边最长匹配,它开头定义了一个最长的序列,对输入字符进行检索。之后它使用matchhere去尝试将匹配扩展到模式的其余部分,以及文本的剩余部分。每次失败将使c的数值减少一,同时它会再次尝试,其中包含了零匹配的情形。
/* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}考虑到通常的表达式(.*),它对括号内的任意字符进行匹配。我们给出目标文本
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)最长的匹配是始自于对整个带括号表达式的定义,而最短的匹配则终止于第一个右括号。(当然从第二个左括号开始的最长匹配,将扩展到文本的最末端。)
Building On It The purpose of TPOP was to teach good programming. At the time the book was written, Rob and I were still at Bell Labs, so we did not have first-hand experience of how the book would be best used in a classroom. It has been gratifying to discover that some of the material does work well in classes. I have used this code since 2000 as a vehicle for teaching important points about programming.
First, it shows how recursion is useful and leads to clean code, in a new setting; it's not yet another version of Quicksort (or factorial!), nor is it some kind of tree walk.
It's also a good example for performance experiments. Its performance is not very different from the system versions of grep, which shows that the recursive technique is not too costly and that it's not worth trying to tune the code.
On the other hand, it is also a fine illustration of the importance of a good algorithm. If a pattern includes several.*s, the straightforward implementation requires a lot of backtracking, and in some cases will run very slowly indeed. (The standard Unix grep has the same properties.) For example, the command
grep 'a.*a.*a.*a.a'
takes about 20 seconds to process a 4 MB text file on a typical machine. An implementation based on converting a non-deterministic finite automaton to a deterministic automaton, as in egrep, will have much better performance on hard cases -- the same pattern and the same input is processed in less than a tenth of a second, and in general, the running time is independent of the pattern.
Extensions to the regular expression class can form the basis of a variety of assignments. For example:
(1) Add other metacharacters, like+for one or more occurrences of the previous character, or?for zero or one matches. Add some way to quote metacharacters, like\$to stand for a literal occurrence of$.
(2) Separate regular expression processing into a "compilation" phase and an "execution" phase. Compilation converts the regular expression into an internal form that makes the matching code simpler or such that subsequent matching runs faster. This separation is not necessary for the simple class of regular expressions in the original design, but it makes sense in grep-like applications where the class is richer and the same regular expression is used for a large number of input lines.
(3) Add character classes like[abc]and[0-9], which in conventional grep notation matchaorborcand a digit respectively. This can be done in several ways, the most natural of which seems to be replacing thechar*'s of the original code with a structure:
typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE; and modifying the basic code to handle an array of these instead of an array of characters. It's not strictly necessary to separate compilation from execution for this situation, but it turns out to be a lot easier. Students who follow the advice to pre-compile into such a structure invariably do better than those who try to interpret some complicated pattern data structure on the fly.
(3)增加像[abc]和[0-9]这样的字符类,在传统的grep表示里,它分别匹配a或b或c和一个数字。这可通过几种方式实现,其中最自然的方式似乎是使用下面的机构替代原始代码中的char*:
typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE;并且修改相应的代码来处理这样的结构数组而不是处理字符数组。在这种情况下,严格的来说,不需要把编译阶段从执行阶段中拆分出来,但是这证明拆分是非常简单的。遵循预编译成这样的结构建议的学生一定比哪些试图在运行过程中解释一些复杂模式的数据结构的学生做得更好。
Writing clear and unambiguous specifications for character classes is tough, and implementing them perfectly is worse, requiring a lot of tedious and uninstructive coding. I have simplified this assignment over time, and today most often ask for Perl-like shorthands such as\dfor digit and\Dfor non-digit instead of the original bracketed ranges.
(4) Use an opaque type to hide the RE structure and all the implementation details. This is a good way to show object-oriented programming in C, which doesn't support much beyond this. In effect, one makes a regular expression class but with function names likeRE_new()andRE_match()for the methods instead of the syntactic sugar of an object-oriented language.
(5) Modify the class of regular expressions to be like the wild cards in various shells: matches are implicitly anchored at both ends,*matches any number of characters, and?matches any single character. One can modify the algorithm or map the input into the existing algorithm.
(6) Convert the code to Java. The original code uses C pointers very well, and it's good practice to figure out the alternatives in a different language. Java versions use eitherString.charAt(indexing instead of pointers) orString.substring(closer to the pointer version). Neither seems as clear as the C code, and neither is as compact. Although performance isn't really part of this exercise, it is interesting to see that the Java implementation runs roughly six or seven times slower than the C versions.
(7) Write a wrapper class that converts from regular expressions of this class to Java's Pattern and Matcher classes, which separate the compilation and matching in a quite different way. This is a good example of the Adapter or Facade pattern, which puts a different face on an existing class or set of functions.
I've also used this code extensively to explore testing techniques. Regular expressions are rich enough that testing is far from trivial, but small enough that one can quickly write down a substantial collection of tests to be performed mechanically. For extensions like those just listed, I ask students to write a large number of tests in a compact language (yet another example of "notation") and use those tests on their own code; naturally I use their tests on other students' code as well.
Conclusions I was amazed by how compact and elegant this code was when Rob Pike first wrote it -- it was much smaller and more powerful than I had thought possible. In hindsight, one can see a number of reasons why the code is so small.
First, the features are well chosen to be the most useful and to give the most insight into implementation, without any frills. For example, the implementation of the anchored patterns^and$requires only 3 or 4 lines, but shows how to handle special cases cleanly before handling the general cases uniformly. The closure operation*is a fundamental notion in regular expressions and provides the only way to handle patterns of unspecified lengths, so it has to be present. But it would add no insight to also provide+and?so those are left as exercises.
Second, recursion is a win. This fundamental programming technique almost always leads to smaller, cleaner and more elegant code than the equivalent written with explicit loops, and that is the case here. The idea of peeling off one matching character from the front of the regular expression and from the text, then recursing for the rest, echoes the recursive structure of the traditional factorial or string length examples, but in a much more interesting and useful setting.
Rob has told me that the recursion was not so much an explicit design decision as a consequence of how he approached the problem: given a pattern and a text, write a function that looks for a match; that in turn needs a "matchhere" function; and so on.
"I have pretty vivid memories of watching the code almost write itself this way. The only challenge was getting the edge conditions right to break the recursion. Put another way, the recursion is not only the implementation method, it's also a reflection of the thought process taken when writing the code, which is partly responsible for the code's simplicity. Most important, perhaps, I didn't have a design when I started, I just began to code and saw what developed. Suddenly, I was done."
If the regular expression is a$at the end of the expression, then the text matches only if it too is at its end.
Otherwise, if we are not at the end of the text string (that is,*text!='\0') and if the first character of the text string matches the first character of the regular expression, so far so good; we go on to test whether the next character of the regular expression matches the next character of the text by making a recursive call tomatchhere. This recursive call is the heart of the algorithm and is why the code is so compact and clean.
If all of these attempts to match fail, then there can be no match at this point in the regular expression and the text, somatchherereturns 0.
This code really uses C pointers. At each stage of the recursion, if something matches, the recursive call that follows uses pointer arithmetic (e.g.,regexp+1andtext+1) so that the next function is called with the next character of the regular expression and of the text. The depth of recursion is no more than the length of the pattern, which in normal use is quite short, so there is no danger of running out of space.
Alternatives This is a very elegant and well-written piece of code, but it's not perfect. What might one do differently? I might rearrangematchhereto deal with$before*. Although it makes no difference here, it feels a bit more natural, and a good rule is to do easy cases before hard ones.
In general, however, the order of tests is critical. For instance, in this test frommatchstar:
} while (*text != '\0' && (*text++ == c || c == '.'));
no matter what, we must advance over one more character of the text string, so the increment intext++must always be performed. This code is careful about termination conditions. Generally, the success of a match is determined by whether the regular expression runs out at the same time as the text does. If they do run out together, that indicates a match; if one runs out before the other, there is no match. This is perhaps most obvious in a line like
if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0'; but subtle termination conditions show up in other places as well.
其他方案 这是一段优雅的写的很好的代码,但是它并不完美。有人可能会按照其他的方式来写。在这里我可能重新组织匹配的顺序,将$匹配的探测置于*之前,尽管这两种处理方式本质上并没有区别,但是这样做看起来更自然些。在处理难题之前先做一些容易的事是一个很好的习惯。
一般情况下,匹配测试的顺序是非常重要的。比如下面的frommatchstar测试:
} while (*text != '\0' && (*text++ == c || c == '.')); 无论什么时候,我们必须向后取一个字符,所以增量intext++必须每次都被执行。
这段代码对于终止条件敏感。一般情况下,匹配成功与否是由正则表达式和文本是否同时结束决定,如果正则表达式和文本一起结束,就表明匹配成功,反之如则匹配失败。这和下面的这种写法很像:
if (regexp[0] == '$' && regexp[1] == '\0') return *text == '\0';但这种终止条件也可能出现在其他地方。
The version ofmatchstarthat implements leftmost longest matching begins by identifying a maximal sequence of occurrences of the input characterc. Then it usesmatchhereto try to extend the match to the rest of the pattern and the rest of the text. Each failure reduces the number ofc's by one and tries again, including the case of zero occurrences.
/* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}
Consider the regular expression(.*), which matches arbitrary text within parentheses. Given the target text
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)
a longest match from the beginning will identify the entire parenthesized expression, while a shortest match will stop at the first right parenthesis. (Of course a longest match beginning from the second left parenthesis will extend to the end of the text.)
matchstar的这个版本实现了左边最长匹配,它开头定义了一个最长的序列,对输入字符进行检索。之后它使用matchhere去尝试将匹配扩展到模式的其余部分,以及文本的剩余部分。每次失败将使c的数值减少一,同时它会再次尝试,其中包含了零匹配的情形。
/* matchstar: leftmost longest search for c*regexp */int matchstar(int c, char *regexp, char *text){ char *t; for (t = text; *t != '\0' && (*t == c || c == '.'); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0;}考虑到通常的表达式(.*),它对括号内的任意字符进行匹配。我们给出目标文本
for (t = text; *t != '\0' && (*t == c || c == '.'); t++)最长的匹配是始自于对整个带括号表达式的定义,而最短的匹配则终止于第一个右括号。(当然从第二个左括号开始的最长匹配,将扩展到文本的最末端。)
Building On It The purpose of TPOP was to teach good programming. At the time the book was written, Rob and I were still at Bell Labs, so we did not have first-hand experience of how the book would be best used in a classroom. It has been gratifying to discover that some of the material does work well in classes. I have used this code since 2000 as a vehicle for teaching important points about programming.
First, it shows how recursion is useful and leads to clean code, in a new setting; it's not yet another version of Quicksort (or factorial!), nor is it some kind of tree walk.
It's also a good example for performance experiments. Its performance is not very different from the system versions of grep, which shows that the recursive technique is not too costly and that it's not worth trying to tune the code.
On the other hand, it is also a fine illustration of the importance of a good algorithm. If a pattern includes several.*s, the straightforward implementation requires a lot of backtracking, and in some cases will run very slowly indeed. (The standard Unix grep has the same properties.) For example, the command
grep 'a.*a.*a.*a.a'
takes about 20 seconds to process a 4 MB text file on a typical machine. An implementation based on converting a non-deterministic finite automaton to a deterministic automaton, as in egrep, will have much better performance on hard cases -- the same pattern and the same input is processed in less than a tenth of a second, and in general, the running time is independent of the pattern.
Extensions to the regular expression class can form the basis of a variety of assignments. For example:
(1) Add other metacharacters, like+for one or more occurrences of the previous character, or?for zero or one matches. Add some way to quote metacharacters, like\$to stand for a literal occurrence of$.
(2) Separate regular expression processing into a "compilation" phase and an "execution" phase. Compilation converts the regular expression into an internal form that makes the matching code simpler or such that subsequent matching runs faster. This separation is not necessary for the simple class of regular expressions in the original design, but it makes sense in grep-like applications where the class is richer and the same regular expression is used for a large number of input lines.
(3) Add character classes like[abc]and[0-9], which in conventional grep notation matchaorborcand a digit respectively. This can be done in several ways, the most natural of which seems to be replacing thechar*'s of the original code with a structure:
typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE; and modifying the basic code to handle an array of these instead of an array of characters. It's not strictly necessary to separate compilation from execution for this situation, but it turns out to be a lot easier. Students who follow the advice to pre-compile into such a structure invariably do better than those who try to interpret some complicated pattern data structure on the fly.
(3)增加像[abc]和[0-9]这样的字符类,在传统的grep表示里,它分别匹配a或b或c和一个数字。这可通过几种方式实现,其中最自然的方式似乎是使用下面的机构替代原始代码中的char*:
typedef struct RE { int type; /* CHAR, STAR, etc. */ char ch; /* the character itself */ char *ccl; /* for [...] instead */ int nccl; /* true if class is negated [^...] */ } RE;并且修改相应的代码来处理这样的结构数组而不是处理字符数组。在这种情况下,严格的来说,不需要把编译阶段从执行阶段中拆分出来,但是这证明拆分是非常简单的。遵循预编译成这样的结构建议的学生一定比哪些试图在运行过程中解释一些复杂模式的数据结构的学生做得更好。
Writing clear and unambiguous specifications for character classes is tough, and implementing them perfectly is worse, requiring a lot of tedious and uninstructive coding. I have simplified this assignment over time, and today most often ask for Perl-like shorthands such as\dfor digit and\Dfor non-digit instead of the original bracketed ranges.
(4) Use an opaque type to hide the RE structure and all the implementation details. This is a good way to show object-oriented programming in C, which doesn't support much beyond this. In effect, one makes a regular expression class but with function names likeRE_new()andRE_match()for the methods instead of the syntactic sugar of an object-oriented language.
(5) Modify the class of regular expressions to be like the wild cards in various shells: matches are implicitly anchored at both ends,*matches any number of characters, and?matches any single character. One can modify the algorithm or map the input into the existing algorithm.
(6) Convert the code to Java. The original code uses C pointers very well, and it's good practice to figure out the alternatives in a different language. Java versions use eitherString.charAt(indexing instead of pointers) orString.substring(closer to the pointer version). Neither seems as clear as the C code, and neither is as compact. Although performance isn't really part of this exercise, it is interesting to see that the Java implementation runs roughly six or seven times slower than the C versions.
(7) Write a wrapper class that converts from regular expressions of this class to Java's Pattern and Matcher classes, which separate the compilation and matching in a quite different way. This is a good example of the Adapter or Facade pattern, which puts a different face on an existing class or set of functions.
I've also used this code extensively to explore testing techniques. Regular expressions are rich enough that testing is far from trivial, but small enough that one can quickly write down a substantial collection of tests to be performed mechanically. For extensions like those just listed, I ask students to write a large number of tests in a compact language (yet another example of "notation") and use those tests on their own code; naturally I use their tests on other students' code as well.
Conclusions I was amazed by how compact and elegant this code was when Rob Pike first wrote it -- it was much smaller and more powerful than I had thought possible. In hindsight, one can see a number of reasons why the code is so small.
First, the features are well chosen to be the most useful and to give the most insight into implementation, without any frills. For example, the implementation of the anchored patterns^and$requires only 3 or 4 lines, but shows how to handle special cases cleanly before handling the general cases uniformly. The closure operation*is a fundamental notion in regular expressions and provides the only way to handle patterns of unspecified lengths, so it has to be present. But it would add no insight to also provide+and?so those are left as exercises.
Second, recursion is a win. This fundamental programming technique almost always leads to smaller, cleaner and more elegant code than the equivalent written with explicit loops, and that is the case here. The idea of peeling off one matching character from the front of the regular expression and from the text, then recursing for the rest, echoes the recursive structure of the traditional factorial or string length examples, but in a much more interesting and useful setting.
Rob has told me that the recursion was not so much an explicit design decision as a consequence of how he approached the problem: given a pattern and a text, write a function that looks for a match; that in turn needs a "matchhere" function; and so on.
"I have pretty vivid memories of watching the code almost write itself this way. The only challenge was getting the edge conditions right to break the recursion. Put another way, the recursion is not only the implementation method, it's also a reflection of the thought process taken when writing the code, which is partly responsible for the code's simplicity. Most important, perhaps, I didn't have a design when I started, I just began to code and saw what developed. Suddenly, I was done."
Third, this code really uses the underlying language to good effect. Pointers can be mis-used, of course, but here they are used to create compact expressions that naturally express the extraction of individual characters and advancing to the next character. The same effect can be achieved by array indexing or substrings, but in this code, pointers do a better job, especially when coupled with C idioms for auto-increment and implicit conversion of truth values.
I don't know of another piece of code that does so much in so few lines, while providing such a rich source of insight and further ideas.