原文地址:http://my.chinaunix.net/space.php?uid=24488136&do=blog&id=64821
在书店去逛的时候,偶然看到了搜索专区的书,都是搜索引擎方面的,翻了下,感觉蛮有意思的,回来就baidu,google了下自己动手做搜索引擎,感觉开源的nutch-1.0蛮好,我就学习配置了下,遇到了一些问题,不过很快解决了。
运行环境:
Linux **-desktop 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:26:08 UTC 2010 i686 GNU/Linux ubuntu 10.04
|
1.安装JDK
因为ubuntu10.04自己自带了jdk(叫做openjdk),所以我就直接用的是自带的jdk。可以直接去新立德软件包里面安装。安装完后在/usr/lib/jvm文件夹下面你就会发现有下面3个文件夹。当然你也可以直接去下载官方最新的jdk.
├── default-java -> java-6-openjdk
├── java-1.6.0-openjdk -> java-6-openjdk
└── java-6-openjdk
|
2.安装并且配置tomcat,在ubuntu10.04中,tomcat的版本是tomcat6,我还安装了管理软件tomcat6-admin
apt-get install tomcat6 tomcat6-admin
|
安装好tomcat之后,输入/etc/init.d/tomcat6
start启动tomcat服务器。在浏览器中输入"http://localhost:8080",如果显示“it
works”说明tomcat服务器正在运行。
It works !
If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations!
This is the default Tomcat home page. It can be found on the local
filesystem at: /var/lib/tomcat6/webapps/ROOT/index.html
Tomcat6 veterans might be pleased to learn that this system instance of
Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6 and
CATALINA_BASE in /var/lib/tomcat6, following the rules from
/usr/share/doc/tomcat6-common/RUNNING.txt.gz.
You might consider installing the following packages, if you haven't
already done so:
tomcat6-docs:
This package installs a web application that allows to browse the
Tomcat 6 documentation locally. Once installed, you can access it by
clicking here.
tomcat6-examples: This package
installs a web application that allows to access the Tomcat 6 Servlet
and JSP examples. Once installed, you can access it by clicking here.
tomcat6-admin: This package installs two web applications that can help managing this Tomcat instance. Once installed, you can access the manager webapp and the host-manager webapp.
NOTE: For security reasons, using the manager webapp is restricted to users with role "manager". The host-manager webapp is restricted to users with role "admin". Users are defined in /etc/tomcat6/tomcat-users.xml.
|
需要配置用户才可以进入管理界面,修改/var/lib/tomcat6/conf/tomcat-users.xml
出于安全考虑,把默认的用户tomcat删掉了,并添加了自己的用户,比如hinutch,添加密码,比如3838438
<?xml version='1.0' encoding='utf-8'?>
<tomcat-users>
<role rolename="manager"/>
<role rolename="admin"/>
<user username="hinutch" password="3838438" roles="admin,manager"/>
</tomcat-users>
|
这个时候你就可以进去管理界面了,如果不行的话,重启tomcat服务/etc/init.d/tomcat6 restart
管理界面如下:
Tomcat Web Application Manager
|
3.安装nutch1.0
下载nutch-1.0.tar.gz,网址http://www.apache.org/dyn/closer.cgi/nutch/
apache-nutch-1.2-bin.zip 25-Sep-2010 05:38 164M
apache-nutch-1.2-bin.zip.asc 25-Sep-2010 05:37 203
apache-nutch-1.2-src.tar.gz 25-Sep-2010 05:37 50M GZIP compressed document
apache-nutch-1.2-src.tar.gz.asc 25-Sep-2010 05:37 203 GZIP compressed document
apache-nutch-1.2-src.zip 25-Sep-2010 05:37 51M
apache-nutch-1.2-src.zip.asc 25-Sep-2010 05:37 203
nutch-0.9.tar.gz 05-Apr-2007 10:17 68M GZIP compressed document
nutch-0.9.tar.gz.asc 05-Apr-2007 10:17 186 GZIP compressed document
nutch-1.0.tar.gz 28-Mar-2009 04:12 83M GZIP compressed document
nutch-1.0.tar.gz.asc 28-Mar-2009 04:12 197 GZIP compressed document
解压出来,我上面的是:
├── bin
├── build.xml
├── CHANGES.txt
├── conf
├── crawled
├── default.properties
├── docs
├── KEYS
├── lib
├── LICENSE.txt
├── logs
├── NOTICE.txt
├── nutch-1.0.jar
├── nutch-1.0.job
├── nutch-1.0.war
├── plugins
├── README.txt
├── src
├── url.txt(这个是自己建的)
└── webapps
|
首先在Nutch的解压根目录下新建一个文本文件,命名为“url.txt”(这个名字你可以随便取)。里面放的是你需要抓取信息的网址。
我的解压根目录为/home/**/nutch-1.0,新建一个url.txt,里面输入:
http://bbs.chinaunix.net/
|
其次更新配置文件crawl-urlfilter.txt,打开“conf/crawl-urlfilter.txt”,
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^http://bbs.chinaunix.net/(这个就是需要修改的,和url.txt里面内容一样)
|
再打开nutch-site.xml文件,修改如下,
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>my nutch agent</value>(红色部分可以自己命名)
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>~
|
然后运行网络蜘蛛抓紧网页。在/home/**/nutch-1.0(即文件根目录)输入以下命令:
./bin/nutch crawl url.txt -dir crawled -depth 4 -topN 100 -threads 4
-dir = crawled 指明下载数据存放路径,该目录不存在时,会被自动创建
-depth = 4 下载深度为4
-topN = 100 下载符合条件的前100个页面
-threads = 4 启动的线程数目
|
蜘蛛运行时会输出大量数据,抓取结束之后,可以发现crawled目录被生成,里面有几个目录。
├── crawldb
├── index
├── indexes
├── linkdb
└── segments
|
4.在tomcat中部署nutch项目
将nutch根目录下的nutch-1.0.war文件放置到/var/lib/tomcat6/webapps文件夹下,然后再访问http://localhost:8080,tomcat便会将其解压。
root@**-desktop:/var/lib/tomcat6/webapps# ls
nutch-1.0 nutch-1.0.war ROOT
|
nutch-1.0文件夹下包含:
├── anchors.jsp
├── ca
├── cached.jsp
├── cluster.jsp
├── de
├── en
├── es
├── explain.jsp
├── fi
├── fr
├── hu
├── img
├── include
├── index.jsp
├── it
├── jp
├── META-INF
├── more.jsp
├── ms
├── nl
├── pl
├── pt
├── refine-query-init.jsp
├── refine-query.jsp
├── search.jsp
├── sh
├── sr
├── sv
├── text.jsp
├── th
├── WEB-INF(要修改该文件夹下面的内容)
└── zh
|
修改此目录下的WEB-INF/classes/nutch-site.xml,修改如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/home/**/nutch-1.0/crawled</value>
</property>
</nutch-conf>
|
上面的value要改成蜘蛛的下载目录。
5.使用nutch搜索
在浏览器中输入http://localhost:8080/nutch-1.0,出现下面的界面:
然后在搜索框里面输入你要查找的东西,比如:linux,会出现:
第
1-1项 (共有 1 项查询结果):
论坛首页 - 中国最大的Linux/Unix技术社区 - IT人的网上社区 - bbs.ChinaUnix.net
... Unix操作系统 ←
Linux论坛 RSS订阅
... by CU管理员
Linux时代首页 Linux
...
http://bbs.chinaunix.net/
(
网页快照)
(
评分详解)
(
anchors)
整个过程就完成了。------------------------------------------------
|
过程中出现的问题 |
------------------------------------------------
1.说找不到JAVA_HOME
解决方案:修改/etc/environment文件,添加JAVA_HOME;
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
JAVA_HOME="/usr/lib/jvm/java-6-openjdk"
|
2.信息是抓取了,但是搜索不出来东西
解决方案:除了修改以上的东西外,有个文件还得注意下:/home/**/nutch-1.0/conf/nutch-default.xml,找到下面的部分,然后参照修改
<!-- searcher properties -->
<property>
<name>searcher.dir</name>
<value>/home/**/nutch-1.0/crawled</value>(一定要是存抓取信息的路径)
<description>
|
有时候出不来结果,还得运行:
/etc/init.d/tomcat6 restart
|
呵呵,就这么多了!!!
posted on 2011-05-04 13:34
漂漂 阅读(1134)
评论(0) 编辑 收藏 引用 所属分类:
linux