ivy-jie

progress ...

C++博客

管理

9 Posts :: 41 Stories :: 6 Comments :: 0 Trackbacks

低频词的过滤

题目描述：请编写程序，从包含大量单词的文本中删除出现次数最少的单词。如果有多个单词都出现最少的次数，则将这些单词都删除。输入数据：程序读入已被命名为corpus.txt的一个大数据量的文本文件，该文件包含英文单词和中文单词，词与词之间以一个或多个whitespace（制表符、空格符和换行符一般被统称为“白字符”(whitespace characters)）分隔。（为便于调试，您可下载测试corpus.txt文件，实际运行时我们会使用不同内容的输入文件。）输出数据：在标准输出上打印删除了corpus.txt中出现次数最少的单词之后的文本（词与词保持原来的顺序，仍以空格分隔）。
评分标准：
程序输出结果必须正确，内存使用越少越好，程序的执行时间越快越好
#include<iostream>
#include<fstream>
#include<map>
#include<vector>
#include<string>
#include<cstring>
#include<cstdlib>
#include<iterator>
#include<algorithm>
#include<cctype>
using namespace std;

typedef map<string,int>::iterator mit;
typedef string::size_type sit;

int main()
{
map<string,int> words_count;
vector<string> sve;
string word;
string s=",!?.:""\n;'";
ifstream fin("E:\\corpus.txt");
if(!fin)
{
   cerr<<"unable to open file"<<endl;
   exit(0);
}
//读取并统计单词单词
while(fin>>word)
   {
    sit iter=word.find_first_of(s);
    if(iter!=string::npos)
      word=word.substr(0,iter-0); //处理标点符号
    string temp(strlwr(const_cast<char*>(word.c_str())));
    word=temp;
    sve.push_back(word);
    ++words_count[word];
   }
fin.close();

//删除个数最少的单词
mit i=words_count.begin();
int n=i->second;
for(;i!=words_count.end();++i)
   if(i->second<n) n=i->second;
for(mit i=words_count.begin();i!=words_count.end(); )
     {
      if(i->second==n)
      {
       sve.erase(remove(sve.begin(),sve.end(),i->first),sve.end());
       ++i;
      }
      else ++i;
     }
//输出到屏幕
copy(sve.begin(),sve.end(),ostream_iterator<string>(cout," "));
cout<<endl;

system("pause");
return 0;
}

posted on 2009-05-20 08:46 ivy-jie 阅读(446) 评论(0) 编辑收藏引用所属分类: arithmetic

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: 转:把十六进制字符串转成数字的函数-类似atoi(char *) 200511 重叠区间大小关于汉字gbk编码 200813 传输规划 200812 圆内五角星低频词的过滤字符串替换

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

ivy-jie

常用链接

留言簿(1)

随笔分类(4)

随笔档案(9)

文章分类(42)

文章档案(41)

搜索

最新评论

阅读排行榜

评论排行榜