随笔 - 224  文章 - 41  trackbacks - 0
<2011年5月>
24252627282930
1234567
891011121314
15161718192021
22232425262728
2930311234

享受编程

常用链接

留言簿(11)

随笔分类(159)

随笔档案(224)

文章分类(2)

文章档案(4)

经典c++博客

搜索

  •  

最新评论

阅读排行榜

评论排行榜

 pymmseg-cpp
http://code.google.com/p/pymmseg-cpp/

pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.

Download the binary release on the right sidebar and copy the pymmseg directory to your Python's path (e.g. /usr/lib/python2.5/site-packages/). Here's an example of usage:

from pymmseg import mmseg

mmseg
.dict_load_defaults()
text
= # ...
algor
= mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text, tok.start, tok.end)

Or you can download the source tarball or check out the latest code from the git repo hosted at github. Then you'll need to build the mmseg-cpp module yourself: goto the mmseg-cpp subdirectory and run the build.py script. It will build the native module for you.

For more information, refer to the README file.


很多同学都会出现乱码的问题。可能是mmseg支持的是utf8, windows的本地默认编码是cp936,也就是gbk编码,所以在控制台直接打印utf-8的字符串当然是乱码了。 
解决方法:
在控制台打印的地方用一个转码就ok了,打印的时候这么写:
print myname.decode('UTF-8').encode('GBK') 


from pymmseg import mmseg

mmseg
.dict_load_defaults()
text
= # ...
algor
= mmseg.Algorithm(text)
for tok in algor:
print '%s [%d..%d]' % (tok.text.decode('UTF-8').encode('GBK') , tok.start, tok.end)

posted on 2011-05-03 13:27 漂漂 阅读(1144) 评论(0)  编辑 收藏 引用

只有注册用户登录后才能发表评论。
网站导航: 博客园   IT新闻   BlogJava   博问   Chat2DB   管理