2006百度之星程序设计大赛试题-百度语言翻译机(解答)

题目:
题目+我的解答打包下载
http://www.cppblog.com/Files/zuroc/06_baidustar_translator.zip
百度语言翻译机

百度的工程师们是非常注重效率的，在长期的开发与测试过程中，他们逐渐创造了一套独特的缩略语。他们在平时的交谈、会议，甚至在各种技术文档中都会大量运用。

为了让新员工可以更快地适应百度的文化，更好地阅读公司的技术文档，人力资源部决定开发一套专用的翻译系统，把相关文档中的缩略语和专有名词翻译成日常语言。

输入要求：
输入数据包含三部分：
1. 第一行包含一个整数N(N<=10000)，表示总共有多少个缩略语的词条；

2. 紧接着有N行的输入，每行包含两个字符串，以空格隔开。第一个字符串为缩略语（仅包含大写英文字符，长度不超过10字节），第二个字符串为日常语言（不包含空格，长度不超过255字节）；

3. 从第N+2开始到输入结束为包含缩略语的相关文档（总长度不超过1000000个字节）。例：

6
PS 门户搜索部
NLP 自然语言处理
PM 产品市场部
HR 人力资源部
PMD 产品推广部
MD 市场发展部
百度的部门包括PS，PM，HR，PMD，MD等等，其中PS还包括NLP小组。
样例：in.txt

输出要求：
输出将缩略语转换成日常语言后的文档。（将缩略语转换成日常语言，其他字符保留原样）。例：

百度的部门包括门户搜索部，产品市场部，人力资源部，产品推广部，市场发展部等等，其中门户搜索部还包括自然语言处理小组。
样例：out.txt

评分规则：

1．程序将运行在一台Linux机器上（内存使用不作严格限制），在每一测试用例上运行不能超过10秒，否则该用例不得分；

2．要求程序能按照输入样例的格式读取数据文件，按照输出样例的格式将运行结果输出到标准输出上。如果不能正确读入数据和输出数据，该题将不得分；

3．该题目共有4个测试用例，每个测试用例为一个输入文件。各测试用例占该题目分数的比例分别为25%，25%，25%，25%；

4．该题目20分。

注意事项：
1．输入数据是中英文混合的，中文采用GBK编码。
GBK：是又一个汉字编码标准，全称《汉字内码扩展规范》。采用双字节表示，总体编码范围为 8140-FEFE，首字节在 81-FE 之间，尾字节在40-FE 之间，排除xx7F。总计 23940 个码位，共收入 21886 个汉字和图形符号，其中汉字（包括部首和构件）21003 个，图形符号 883 个。

2．为保证答案的唯一性，缩略语的转换采用正向最大匹配（从左到右为正方向）原则。请注意样例中PMD的翻译。

代码:
/*
我的思路

1.缩略语
vector< string >   //用来保存缩略语
按string的length排序,来满足"缩略语的转换采用正向最大匹配".

2.一次性的进行文本替换,以防止替换内容再次被替换
map<pair<int,int>,string>       //位置范围-缩略语
vector<pair<int,int>>   //保存位置范围
map<string,string>   //缩略语
*/

#include <fstream>
#include <sstream>
#include <iostream>

#include <vector>
#include <map>
#include <list>

#include <string>

#include <algorithm>
#include <functional>

using namespace std;

#define BEG_END(c)       (c.begin()),(c.end())

typedef string::size_type str_size;

/** 转换string为指定的类型 */
template<typename Target, typename Source>
Target lexical_cast(const Source& arg)
{
    Target result;
    istringstream(arg)>>result;
    return result;
}

vector<str_size> find_all(const string& source , const string& aim)
{

    vector<str_size>   poses;

    str_size pos=0;
    str_size aim_len=aim.size();

    while ( (pos=source.find(aim, pos)) != string::npos)
    {
            poses.push_back(pos);
            pos += aim_len;
    }

    return poses;
}

bool is_long(const string& a , const string& b)
{
    return a.length()>b.length();
}

bool is_first_small(const pair<str_size,str_size>& a , const pair<str_size,str_size>& b)
{
    return a.first<b.first;
}

template<class T,class I>
bool not_in_scope(I begin,const I& end,const T& aim)
{
    for (;begin!=end;++begin)
    {
        if (
            (aim>=(begin->first) ) && (aim<= (begin->first+begin->second) )
        )return false;
    }
    return true;
}

int main()
{

    string infile_name="in.txt" , outfile_name="out.txt";

    ofstream outfile(outfile_name.c_str());
    //ostream& outfile = cout;

    ifstream infile(infile_name.c_str());
    if (!infile)
    {
        cerr<<"Error : can't open input file "<<infile_name<<" .\n";
        return -1;
    }

    string line;
    vector<string> abbr_dict;
    map<string,string>   abbr_word;

    getline(infile,line);
    for (int i=lexical_cast<int>(line);i!=0;--i)
    {
        getline(infile,line);
        string abbr,word;
        istringstream(line)>>abbr>>word;
        abbr_dict.push_back(abbr);
        abbr_word[abbr]=word;
        //cout<<abbr<<' '<<word<<'\n';
    }

    sort(BEG_END(abbr_dict),is_long);

    while (getline(infile,line))
    {
        typedef vector<pair<str_size,str_size> > replace_scope;

        replace_scope   to_replace_scope;
        map<pair<str_size,str_size>,string>   to_replace;

        for (
            vector<string>::iterator i=abbr_dict.begin(),end=abbr_dict.end();
            i!=end;
            ++i
        )
        {
            vector<str_size>   poses=find_all(line,*i);
            str_size aim_len=i->size();
            for (vector<str_size>::iterator j=poses.begin(),end=poses.end();j
                    !=end;++j)
            {
                pair<str_size,str_size> scope=make_pair(*j,aim_len);
                if (not_in_scope(BEG_END(to_replace_scope),*j))
                {
                    to_replace_scope.push_back(scope);
                    to_replace[scope]=*i;
                }
            }
        }

        sort(BEG_END(to_replace_scope),is_first_small);

        str_size offset=0;

        for (
            replace_scope::iterator i=to_replace_scope.begin(),end=to_replace_scope.end();
            i!=end;
            ++i
        )
        {
            str_size len=i->second ;
            string word=abbr_word[to_replace[*i]];
            line.replace(i->first+offset,len ,word);
            offset+=word.size()-len;
        }

        outfile<<line<<'\n';
    }

    return 0;
}

posted on 2007-05-13 21:25 张沈鹏阅读(1063) 评论(0) 编辑收藏引用所属分类: C++

只有注册用户登录后才能发表评论。


相关文章: 2006百度之星程序设计大赛试题-变态比赛规则(解答) 2006百度之星程序设计大赛试题-百度语言翻译机(解答) [翻译]Berkeley DB 文档 - C++入门篇 - 1.3节 - 访问方式(Access Methods) [意译]Berkeley DB 文档 - C++入门篇 - 1.2节 - Berkeley DB 概述 C++ std名字空间ostream_iterator与的诡异问题 boost::asio网络库教程翻译更新中。 Boost.Asio 0.37教程 Timer.1(翻译自Boost.Asio 0.37的文档) Boost.Asio 0.37简介(翻译自Boost.Asio 0.37的文档的首页) 由boost网络库说起... boost::sockets 候选库

网站导航: 博客园博客园最新博文博问管理

导航

常用链接

留言簿(3)

随笔分类(44)

随笔档案(65)

相册

友情Link

最新随笔

搜索

积分与排名

最新评论

阅读排行榜