python 中文分词(pymmseg -cpp)和中文乱码的问题

pymmseg-cpp

http://code.google.com/p/pymmseg-cpp/

pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.

Download the binary release on the right sidebar and copy the pymmseg directory to your Python's path (e.g. /usr/lib/python2.5/site-packages/). Here's an example of usage:

from pymmseg import mmseg
 
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
    print '%s [%d..%d]' % (tok.text, tok.start, tok.end)

Or you can download the source tarball or check out the latest code from the git repo hosted at github. Then you'll need to build the mmseg-cpp module yourself: goto the mmseg-cpp subdirectory and run the build.py script. It will build the native module for you.

For more information, refer to the README file.

很多同学都会出现乱码的问题。可能是mmseg支持的是utf8， windows的本地默认编码是cp936，也就是gbk编码，所以在控制台直接打印utf-8的字符串当然是乱码了。
解决方法：
在控制台打印的地方用一个转码就ok了，打印的时候这么写：
print myname.decode('UTF-8').encode('GBK')

from pymmseg import mmseg
 
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
    print '%s [%d..%d]' % (tok.text.decode('UTF-8').encode('GBK') , tok.start, tok.end)

posted on 2011-05-03 13:27 漂漂阅读(1147) 评论(0) 编辑收藏引用

常用链接

留言簿(11)

随笔分类(159)

随笔档案(224)

文章分类(2)

文章档案(4)

经典c++博客

搜索

最新评论

阅读排行榜

评论排行榜

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！



网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理