题目链接:
http://www.pythonchallenge.com/pc/def/ocr.html 根据提示,题目要求是从html页面源文件的一段文本中找出rare characters。 何为rare,暂时不知道,不过不要紧,先把整段文本存放于一个叫fin.txt的文件中,预处理一下:
if __name__ == '__main__':
finpath = 'fin.txt'
with open(finpath) as fin:
# translate text into a single string
text = ''.join([line.rstrip() for line in fin.read()])
d= {}
for c in text:
d[c] = d.get(c, 0) +1
for k, v in d.items():
print(k, v)
输出结果:
! 6079
# 6115
% 6104
$ 6046
& 6043
) 6186
( 6154
+ 6066
* 6034
@ 6157
[ 6108
] 6152
_ 6112
^ 6030
a 1
e 1
i 1
l 1
q 1
u 1
t 1
y 1
{ 6046
} 6105
好了,很显然了, rare characters指的就是个数为1的这几个字母, 于是将代码稍微改一下即可打印得到结果:
if __name__ == '__main__':
finpath = 'fin.txt'
with open(finpath) as fin:
# translate text into a single string
text = ''.join([line.rstrip() for line in fin.read()])
d= {}
for c in text:
d[c] = d.get(c, 0) +1
print(''.join([c for c in text if d[c] ==1]))
程序输出: equality
考虑到结果集中未输出的都是非字母,因此可以考虑如下方法求解:
if __name__ == '__main__':
finpath = 'fin.txt'
with open(finpath) as fin:
# translate text into a single string
text = ''.join([line.rstrip() for line in fin.read()])
# only print letters
print(''.join([c for c in text if c.isalpha()]))
# another method
print(''.join(filter(lambda x: x.isalpha(), text)))
参考答案