随笔-341  评论-2670  文章-0  trackbacks-0
    字符集的正规化指的是让正则表达式的表达式树的所有节点中记录的字符集合的最小单元都是互不交叉的。举个例子,[a-g][h-n]没有交叉,但是[a-g][g-n]就交叉了。所以对[a-g][g-n]做字符集正规化的结果就是将表达式修改为([a-f]|g)(g|[h-n])。这样表达式里面出现的字符集合的最小单元[a-f]、g和[h-n]就没有交叉了。下面是正规化的代码:

    正规化包含两个步骤,第一步是检查所有的字符集表达式然后做出一张正规化列表,譬如从表达式[a-g][g-n]抽取出正规化列表[a-f]、g和[h-n]。第二步则使用这张列表重写表达式。[a-g]=[a-f]|g而[h-n]=h|[g-n],于是便改写成了([a-f]|g)(g|[h-n])。在这里我们使用上一篇文章的visitor模式来完成。第一步和第二步的共同点是遍历所有的节点,然后获取所有的CharSetExpression。他们的区别仅仅在于如何对待CharSetExpression上。所以我们先写一个算法基类:
 1         class CharSetAlgorithm : public RegexExpressionAlgorithm<void, NormalizedCharSet*>
 2         {
 3         public:
 4             void Apply(LoopExpression* expression, NormalizedCharSet* target)
 5             {
 6                 Invoke(expression->expression, target);
 7             }
 8 
 9             void Apply(SequenceExpression* expression, NormalizedCharSet* target)
10             {
11                 Invoke(expression->left, target);
12                 Invoke(expression->right, target);
13             }
14 
15             void Apply(AlternateExpression* expression, NormalizedCharSet* target)
16             {
17                 Invoke(expression->left, target);
18                 Invoke(expression->right, target);
19             }
20 
21             void Apply(BeginExpression* expression, NormalizedCharSet* target)
22             {
23             }
24 
25             void Apply(EndExpression* expression, NormalizedCharSet* target)
26             {
27             }
28 
29             void Apply(CaptureExpression* expression, NormalizedCharSet* target)
30             {
31                 Invoke(expression->expression, target);
32             }
33 
34             void Apply(MatchExpression* expression, NormalizedCharSet* target)
35             {
36             }
37 
38             void Apply(PositiveExpression* expression, NormalizedCharSet* target)
39             {
40                 Invoke(expression->expression, target);
41             }
42 
43             void Apply(NegativeExpression* expression, NormalizedCharSet* target)
44             {
45                 Invoke(expression->expression, target);
46             }
47 
48             void Apply(UsingExpression* expression, NormalizedCharSet* target)
49             {
50             }
51         };

    足够细心的话会发现Apply(CharSetExpression*)没有了。这是当然的,因为下面两个算法将补全之。首先是提取正规化列表。方法很简单,找出每一个字符集,用它来切割正规化列表就好了。举个例子,我们处理[a-g][g-h],首先获得[a-g],然后通过跟[g-h]比较知道他们有交集,于是提取交集g,然后切割一下就行了:
 1         class BuildNormalizedCharSetAlgorithm : public CharSetAlgorithm
 2         {
 3         public:
 4             void AddRange(NormalizedCharSet* target, CharRange range)
 5             {
 6                 int index=0;
 7                 while(index<target->ranges.Count())
 8                 {
 9                     CharRange current=target->ranges[index];
10                     if(current<range || current>range)
11                     {
12                         index++;
13                     }
14                     else if(current.begin<range.begin)
15                     {
16                         // range   :    [    ?
17                         // current : [       ]
18                         target->ranges.RemoveAt(index);
19                         target->ranges.Add(CharRange(current.begin, range.begin-1));
20                         target->ranges.Add(CharRange(range.begin, current.end));
21                         index++;
22                     }
23                     else if(current.begin>range.begin)
24                     {
25                         // range  :  [       ]
26                         // current  :   [    ?
27                         target->ranges.Add(CharRange(range.begin, current.begin-1));
28                         range.begin=current.begin;
29                     }
30                     else if(current.end<range.end)
31                     {
32                         // range   : [       ]
33                         // current : [    ]
34                         range.begin=current.end+1;
35                         index++;
36                     }
37                     else if(current.end>range.end)
38                     {
39                         // range   : [    ]
40                         // current : [       ]
41                         target->ranges.RemoveAt(index);
42                         target->ranges.Add(range);
43                         target->ranges.Add(CharRange(range.end+1, current.end));
44                         return;
45                     }
46                     else
47                     {
48                         // range   : [       ]
49                         // current : [       ]
50                         return;
51                     }
52                 }
53                 target->ranges.Add(range);
54             }

    于是,我们拿到了这张列表之后,就可以重写表达式了:
 1         class SetNormalizedCharSetAlgorithm : public CharSetAlgorithm
 2         {
 3         public:
 4             void Apply(CharSetExpression* expression, NormalizedCharSet* target)
 5             {
 6                 CharRange::List result;
 7                 for(int i=0;i<target->ranges.Count();i++)
 8                 {
 9                     CharRange targetRange=target->ranges[i];
10                     for(int j=0;j<expression->ranges.Count();j++)
11                     {
12                         CharRange range=expression->ranges[j];
13                         if(range.begin<=targetRange.begin && targetRange.end<=range.end)
14                         {
15                             result.Add(targetRange);
16                         }
17                     }
18                 }
19                 expression->ranges.Clear();
20                 CopyFrom(expression->ranges.Wrap(), result.Wrap());
21             }
22         };

    最后在Expression那里封装一下就大功告成了:
1         void Expression::NormalizeCharSet()
2         {
3             NormalizedCharSet normalized;
4             BuildNormalizedCharSetAlgorithm().Invoke(this&normalized);
5             SetNormalizedCharSetAlgorithm().Invoke(this&normalized);
6         }

    至于什么是NormalizedCharSet,这只是一个拥有成员SortedList<CharRange>的类罢了。至此我们还看到了Visitor的另一个优点:可以提取算法的公共部分。
posted on 2009-10-17 20:43 陈梓瀚(vczh) 阅读(1881) 评论(1)  编辑 收藏 引用 所属分类: VL++3.0开发纪事

评论:
# re: Vczh Library++3.0之正则表达式引擎(字符集正规化) 2009-10-19 00:26 | pp
楼主活在自己的世界里不亦乐呼啊  回复  更多评论
  

只有注册用户登录后才能发表评论。
网站导航: 博客园   IT新闻   BlogJava   知识库   博问   管理