Version 3 (modified by 14 years ago) (diff) | ,
---|
Nutch Installation and Search
Nutch Download
Grab the latest nutch package from the nutch site. Version 1.0 is is the latest as of this writing. Unpackage nutch into /usr/local/nutch/.
Add PDF Support - Rebuild Nutch
Download pdf libraries
cd src/plugin/parse-pdf/lib wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar
In src/plugin/parse-pdf/plugin.xml
<!-- Uncomment the following two lines after you have downloaded the libraries, see README.txt for more details.--> <library name="jai_codec.jar"/> <library name="jai_core.jar"/>
Rebuild Nutch:
cd /usr/local/nutch/ ant jar ant compile-plugins ant war
Copy nutch war file to tomcat directory, naming it ROOT.war:
cp /usr.local/nutch/nutch-1.0.war /usr/local/tomcat/webapps/ROOT.war
Update tomcat nutch instance so that it uses the correct crawl directory: Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<configuration> <property> <name>searcher.dir</name> <value>/usr/wwwapps/crawl_data/</value> <description>Path to root of crawl</description> </property> </configuration>
Edit Nutch Conf Files
Edit regex-urlfilter.txt:
# skip http: ftp: and mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # Do not follow backwards -/\.\.(|/)$ # These confine nutch crawl to the following areas: +^file:///FILES/ +^file:/FILES/ # skip everything else -.
Edit crawl-urlfilter.txt:
# skip file:, ftp:, & mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # These confine nutch crawl to the following areas: +^file:///FILES/.* +^file:/FILES/.* # skip everything else -.
Edit conf/nutch-site.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Enable file plugin.</description> </property> <property> <name>file.content.limit</name> <value>-1</value> <description>No length limit for crawled content.</description> </property> <property> <name>http.agent.name</name> <value>Nutch</value> </property> <property> <name>http.agent.description</name> <value>Nutch Crawler</value> </property> <property> <name>http.agent.url</name> <value></value> </property> <property> <name>http.agent.email</name> <value></value> </property> <property> <name>searcher.dir</name> <value>/usr/wwwapps/crawl_data/</value> <description>Path to root of crawl</description> </property> </configuration>
Prepare for a Crawl
First create urls/files so that nutch knows what to crawl. This file can be anywhere, but I'll put it in the main crawl_data directory for now.
mkdir /usr/wwwapps/crawl_data/urls/
Create/edit file called files in urls directory. Add the following line to search the /FILES/ directory:
file:///FILES/
Add nutch link to /usr/bin/ (just for convenience)
ln -s /usr/bin/nutch /usr/local/nutch/bin/nutch
Run a Crawl
nutch crawl /usr/wwwapps/crawl_data/urls -dir /usr/wwwapps/crawl_data/
Check for results using NutchBean?
nutch
Helpful Articles
Here are some helpful articles about setting up Nutch, crawling, and recrawling.
Introduction to Nutch, Part 1: Crawling |
Introduction to Nutch, Part 2: Searching |
Crawling the local filesystem with nutch |
Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary. This can be