对于hand written的lexical analyzer来说,NFA和DFA的运用是不可避免的,除非你的grammer十分简单。
一旦给出了source program(也就是你想处理的character stream)的一个pattern的正则表达式,就可以构造对应的NFA,然后转换为DFA,这个DFA就可以用来处理你的source program, 将里面能够match这个pattern的lexeme全都找出来。按照这样的流程,对于一种编程语言,不管是常用的语言,还是脚本语言,只要对所有的pattern构造DFA,就能够写出自己的lexical analyzer了。
有两篇关于正则表达式到DFA的文章写的很好:
1.Writing own regular expression parser
By Amer Gerzic英文的
http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx
有源码
2.
《构造正则表达式引擎》新鲜出炉啦!中文的,by vczh,华南理工大学
http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
阅读完上面两篇文章,写个能够运行的lexer就不成问题了。
另外附上龙书(Compilers, principles techniques and tools)里一段token,pattern和lexeme术语的区别:
1. A t o k e n is a pair consisting of a token name and an optional attribute
value. The token name is
an abstract symbol representing a kind of
lexical unit(lexeme), e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the
parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
2. A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
3. A lexeme is a sequence of characters in the source program that matches
the pattern for a token and is identified by the lexical analyzer as an
instance of that token. notes:
1. more than one lexeme can match a pattern
2. 看看example 3.1