pymmseg-cpp
pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.
Download the binary release on the right sidebar and copy the pymmseg directory to your Python's path (e.g. /usr/lib/python2.5/site-packages/). Here's an example of usage:
from pymmseg import mmseg
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text, tok.start, tok.end)
Or you can download the source tarball or check out the latest code from the git repo hosted at github. Then you'll need to build the mmseg-cpp module yourself: goto the mmseg-cpp subdirectory and run the build.py script. It will build the native module for you.
For more information, refer to the README file.
很多同学都会出现乱码的问题。可能是mmseg支持的是utf8,
windows的本地默认编码是cp936,也就是gbk编码,所以在控制台直接打印utf-8的字符串当然是乱码了。 解决方法:
在控制台打印的地方用一个转码就ok了,打印的时候这么写:
print myname.decode('UTF-8').encode('GBK')
from pymmseg import mmseg
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text.decode('UTF-8').encode('GBK') , tok.start, tok.end)
posted on 2011-05-03 13:27
漂漂 阅读(1144)
评论(0) 编辑 收藏 引用