清风竹林

ぷ雪飘绛梅映残红
ぷ花舞霜飞映苍松
----- Do more,suffer less

Python Challenge lv2: ocr

题目链接： http://www.pythonchallenge.com/pc/def/ocr.html
根据提示，题目要求是从html页面源文件的一段文本中找出rare characters。何为rare，暂时不知道，不过不要紧，先把整段文本存放于一个叫fin.txt的文件中，预处理一下：

if __name__ == '__main__':

finpath = 'fin.txt'

with open(finpath) as fin:

# translate text into a single string

text = ''.join([line.rstrip() for line in fin.read()])

d= {}

for c in text:

d[c] = d.get(c, 0) +1

for k, v in d.items():

print(k, v)

输出结果：

! 6079
# 6115
% 6104
$ 6046
& 6043
) 6186
( 6154
+ 6066
* 6034
@ 6157
[ 6108
] 6152
_ 6112
^ 6030
a 1
e 1
i 1
l 1
q 1
u 1
t 1
y 1
{ 6046
} 6105

好了，很显然了， rare characters指的就是个数为1的这几个字母，于是将代码稍微改一下即可打印得到结果：

if __name__ == '__main__':

finpath = 'fin.txt'

with open(finpath) as fin:

# translate text into a single string

text = ''.join([line.rstrip() for line in fin.read()])

d= {}

for c in text:

d[c] = d.get(c, 0) +1

print(''.join([c for c in text if d[c] ==1]))

程序输出： equality

考虑到结果集中未输出的都是非字母，因此可以考虑如下方法求解：

if __name__ == '__main__':

finpath = 'fin.txt'

with open(finpath) as fin:

# translate text into a single string

text = ''.join([line.rstrip() for line in fin.read()])

# only print letters

print(''.join([c for c in text if c.isalpha()]))

# another method

print(''.join(filter(lambda x: x.isalpha(), text)))

参考答案

posted on 2009-05-11 15:40 李现民阅读(1236) 评论(0) 编辑收藏引用所属分类: python

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: Python Challenge lv5: peak hell Python Challenge lv4: follow the chain Python Challenge lv3: re Python Challenge lv2: ocr Python Challenge lv1: What about making trans?

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

清风竹林

导航

统计

常用链接

留言簿(5)

随笔分类

随笔档案

相册

TLink

搜索

最新评论

阅读排行榜

评论排行榜

Python Challenge lv2: ocr