金庆的专栏

:: 管理 ::

423 随笔 :: 0 文章 :: 454 评论 :: 0 Trackbacks

批量html转text

（转载请注明来源于金庆的专栏）

原来的代码是参考“Recipe 12.11. Using MSHTML to Parse XML or HTML”，利用htmlfile提取文本。
将当前目录下的所有html文件转换为text文件。

def extractHtmlFile(htmlFilePath):
    '''Extract html text and save to text file.
    '''
    htmlData = file(htmlFilePath, 'r').read()
    import win32com.client
    html = win32com.client.Dispatch('htmlfile')
    html.writeln(htmlData)
    text = html.body.innerText.encode('gbk', 'ignore')

但是发现MSHTML解析文件可能会出错，造成文本提取失败。

jigloo经过对10W+个html文件的测试，得出结论，htmlfile的容错比InternetExplorer.Application要差很多。
原文见：http://groups.google.com/group/python-cn/msg/c9221764bcafbc21
他的代码大致如下，IE使用稍烦：

#!/usr/bin/env python

import sys, os, re, codecs
import time
import win32com.client

class htmlfile:
    def __init__(self):
        self.__ie = win32com.client.Dispatch('InternetExplorer.Application')
        self.__ie.Silent = True
        self.__filename  = ''
        self.__document  = None

    def __del__(self):
        self.__ie.Quit()

    def __getdocument(self, filename):
        filename = os.path.abspath(filename)
        if self.__filename != filename:
            self.__filename = filename
            self.__ie.Navigate2(filename)
            self.__ie.Document.close()
            while self.__ie.Document.Body is None:
                time.sleep(0.1)
            self.__document = self.__ie.Document
        return self.__document
    def gettext(self, filename):
        return self.__getdocument(filename).Body.innerText
    def gettitle(self, filename):
        return self.__getdocument(filename).title

if __name__ == '__main__':
    hf = htmlfile()
    for root, dirs, names in os.walk(u'.'):
        for name in names:
            if name.endswith('htm') or name.endswith('html'):
                htmlpath = os.path.join(root, name)
                textpath = htmlpath + '.txt'
                file(textpath, 'wb').write(hf.gettext(htmlpath).encode('mbcs'))
            # End of if.
        # End of for name.
    # End of for root.
    del hf
# End of if.

对于我的简单任务，这就足够了。

有一个问题，如果有资源管理器打开着，运行这段代码会关闭资源管理器，并出错退出。比较奇怪，但应该不难解决，可能是IE控件的使用上还有问题。

self.__ie.Document.close()
File "C:\Python25\Lib\site-packages\win32com\client\dynamic.py", line 496, in
__getattr__
raise AttributeError, "%s.%s" % (self._username_, attr)
AttributeError: Document.close

posted on 2008-03-13 11:55 金庆阅读(1713) 评论(1) 编辑收藏引用所属分类: 6. Python

# re: 批量html转text 2008-12-01 15:48 Hanqing Chen

你好,我需要这个程序代码,可以发给我一份吗?
不胜感谢,我的邮箱 lychenhanqing@163.com 回复更多评论

刷新评论列表

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: vs2017 linux 编译输出改成 vs 格式 xlsx批量转为utf8的csv 如何运行 rpcz python example Windows上Python读取stdin出错建立Socket Policy服务器 python计算24点 (Python编程)Pickle对象 Boost.Python中文文档下载用Boost.Python构建混合系统 Python封装的性能研究

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

金庆的专栏

公告

常用链接

留言簿(12)

随笔分类(502)

随笔档案(423)

相册

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜

评论