啥也不说,先看看一段来自
LingosHook的代码先~
int CHtmlDictParser::HtmlDataType1Proc(const std::wstring &html, const std::wstring &dictid, const TinyHtmlParser::CDocumentObject &doc, const TinyHtmlParser::CElementObject *dict, const TinyHtmlParser::CElementObject *pdiv, const HtmlDictParser::TDictResult &res, TResultMap &result) const
{
const TinyHtmlParser::CElementObject *p = pdiv->child;
if(p == NULL)
return -1;
p = p->child;
while(p != NULL)
{
if(p->child == NULL || p->child->child == NULL || p->child->child->sibling == NULL
|| p->child->child->sibling->child == NULL || p->child->child->sibling->child->child == NULL
|| p->child->child->sibling->child->child->child == NULL)
return 0;
std::wstring word = p->child->child->sibling->child->child->child->value;
if(PushResult(word, res, result) != 0)
return -1;
if(p->sibling == NULL || p->sibling->sibling == NULL)
break;
p = p->sibling->sibling;
}
return 0;
}
此函数用于分解出下面HTML数据中的单词,只是其中那段if语句是否让你感到眼晕?
<id="dict_body_7AB175CC5F622A44A0DECE976AF22A16">
<div id="dict_gls_7AB175CC5F622A44A0DECE976AF22A16">
<div style="MARGIN: 5px 0px">
<div style="WIDTH: 100%">
<div style="FLOAT: left; LINE-HEIGHT: normal">
<img height="11" src=
"file:///C:/Program%20Files/Lingoes/Translator2.7/dict/image/entry_p.png"
width="10" align="absmiddle" border="0">
</div>
<div style="OVERFLOW-X: hidden; WIDTH: 100%">
<div style=
"MARGIN: 0px 0px 5px; COLOR: #808080; LINE-HEIGHT: normal">
<span style=
"FONT-SIZE: 10.5pt; COLOR: #000000; LINE-HEIGHT: normal"><b>
AC</b></span>
</div>
<div style="MARGIN: 0px 0px 5px">
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
公元前
</div>
</div>
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[计]</font> 存取周期,
累加器, 声耦合器, 交流, 应用控制,
自动检查, 自动计算机
</div>
</div>
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[化]</font> 交流; 交变电流
</div>
</div>
</div>
</div>
</div>
</div>
<div style=
"PADDING-RIGHT: 0px; BORDER-TOP: #c7d4dc 1px solid; PADDING-LEFT: 0px; PADDING-BOTTOM: 0px; PADDING-TOP: 5px">
</div>
<div style="MARGIN: 5px 0px">
<div style="WIDTH: 100%">
<div style="FLOAT: left; LINE-HEIGHT: normal">
<img height="11" src=
"file:///C:/Program%20Files/Lingoes/Translator2.7/dict/image/entry_p.png"
width="10" align="absmiddle" border="0">
</div>
<div style="OVERFLOW-X: hidden; WIDTH: 100%">
<div style=
"MARGIN: 0px 0px 5px; COLOR: #808080; LINE-HEIGHT: normal">
<span style=
"FONT-SIZE: 10.5pt; COLOR: #000000; LINE-HEIGHT: normal"><b>
Ac.</b></span>
</div>
<div style="MARGIN: 0px 0px 5px">
<div style="MARGIN: 4px 0px">
<div style="MARGIN: 4px 0px">
<font color="navy">[医]</font> 锕(89号元素)
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
Lingoes的结果数据看似很有规律,实际内部有非常微小的差异,为提高LingosHook识别能力,不得不需要非常仔细地分析这些数据,以找出其规律。这个分析过程,让我想起去年
破解WOW的MPQ文件时的经历,痛苦啊,有兴趣,可查看
这里的贴图~
目前测试的词典多数可以归为两类,以分别写了相应的函数进行处理,根据“解密”结果优化HTML处理过程,尽量做到快速和通用,再检测几个词典结果,过两天应该可以更新了。唉,累死我了,还好这几天工作上没有“情况”发生,“解密”是需要消耗大把时间的~