LingosHook：自己造的轮子对路面要求太高了～

为了搞定HTML，这几天在学着用Tidylib，终于搞明白了，喜滋滋地合并到代码中一测试，傻眼了－－字符集问题。。。
Tidylib的输入流似乎只支持const char*, 因此不的不将std::wstring从‘宽字节’转换为‘多字节。转换了几次，时好时坏，搞到半夜才发现自己所用的几个测试HTML页面都各种包含着不同的字符集，于是字符集问题就出来了，也搞死我了～最后一咬牙，一跺脚，老子我不转了，都用‘RAW’数据好了，‘宽’到‘多’直接用UTF8了。。。于是就有了下面的代码。

int CHtmlTidyObject::Tidy(const std::wstring &input, std::wstring &output)

{

int codepage = CP_UTF8;//54936;//CP_UTF8;

int ret = -1;

TidyDoc tdoc = tidyCreate();

if(tidyOptSetBool(tdoc, TidyMark, no) != yes)

return -1;

if(tidyOptSetInt(tdoc, TidyDoctypeMode, TidyDoctypeOmit) != yes)

return -1;

if(tidyOptSetBool(tdoc, TidyHideComments, yes) != yes)

return -1;

if(tidyOptSetInt(tdoc, TidyWrapLen, 0) != yes)

return -1;

//if(tidyOptSetBool(tdoc, TidyMakeClean, yes) != yes)//css

// return -1;

if(tidyOptSetBool(tdoc, TidyUpperCaseTags, yes) != yes)

return -1;

if(tidyOptSetBool(tdoc, TidyHtmlOut, yes) != yes)

return -1;

if(tidySetCharEncoding(tdoc, "raw") != 0)

return -1;

if(tidyOptSetBool(tdoc, TidyShowWarnings, no) != yes)

return -1;

if(tidyOptSetInt(tdoc, TidyShowErrors, 0) != yes)

return -1;

if(tidyOptSetBool(tdoc, TidyForceOutput, yes) != yes)

return -1;

int sz = WideCharToMultiByte(codepage, 0, input.c_str(), input.size(), NULL, 0, NULL, NULL);

if(sz == -1)

return -1;

char* buf = new char[sz + 1];

sz = WideCharToMultiByte(codepage, 0, input.c_str(), input.size(), buf, sz, NULL, NULL);

if(tidyParseString(tdoc, buf) >= 0)

{

//TidyBuffer errbuf = {0};

//tidySetErrorBuffer( tdoc, &errbuf );

if(tidyCleanAndRepair(tdoc) >= 0)

{

//tidyRunDiagnostics( tdoc );

TidyBuffer outbuf = { 0 };

if(tidySaveBuffer(tdoc, &outbuf) >= 0)

{

//std::cout << "OUTPUT->\n" << outbuf.bp << std::endl;

int wsz = MultiByteToWideChar(codepage, 0, (const char*)outbuf.bp, outbuf.size, NULL, 0);

wchar_t* wbuf = new wchar_t[wsz + 1];

wsz = MultiByteToWideChar(codepage, 0, (const char*)outbuf.bp, outbuf.size, wbuf, wsz);

output = wbuf;

delete [] wbuf;

ret = 0;

}

tidyBufFree(&outbuf);

}

//std::cout << "ERROR->\n" << errbuf.bp << std::endl;

//tidyBufFree(&errbuf);

}

delete [] buf;

tidyRelease(tdoc);

return ret;

}

感觉还有问题，但经过Tidy处理，TinyHtmlParser确实能解析原来解不开的HTML数据了，就先放着吧，测试看看先～唉，HTML从头到尾都是最影响LingosHook的部分，早知道应该多好好找找稳定的Parser，自己造的轮子对路面要求太高了。。。

posted on 2010-05-12 18:06 codejie 阅读(531) 评论(2) 编辑收藏引用所属分类: C++ 、LingosHook

# re: LingosHook：自己造的轮子对路面要求太高了～ 2010-05-13 11:42 陈梓瀚(vczh)

HTML这种一个字符串可以由多个字符集构成的东西，根本就是结构化的二进制文件，而不是文本文件…… 回复更多评论

# re: LingosHook：自己造的轮子对路面要求太高了～ 2010-05-13 12:12 codejie

@陈梓瀚(vczh)
这个‘结构化’放到XML文件上比较合适，HTML就不合适了。。不然哪有‘浏览器容错能力‘之说呢。。所以，还是当文本处理比较方便点。回复更多评论

刷新评论列表

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: C++: c++ wrap for libpq Why the inline function can not be covered? LingosHook : Development Environment Setup LingosHook : HTML Data in Android Client LingosHook: CDocumentOutputObject LingosHook : Optimize TinyHtmlParser Class OCI : do NOT debug on TWO different windows LingosHook：Lingoes生词本第二十五版 LingosHook：1.4.000 and gettext wxWidget：Catch KEY event of wxStaticBitmap.

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

# re: LingosHook：自己造的轮子对路面要求太高了～ 2010-05-13 11:42 陈梓瀚(vczh)

# re: LingosHook：自己造的轮子对路面要求太高了～ 2010-05-13 12:12 codejie

Codejie's C++ Space

LingosHook：自己造的轮子对路面要求太高了～

评论

公告

导航

统计

留言簿(73)

随笔分类(513)

积分与排名

最新评论

阅读排行榜

评论排行榜