Version 14 (modified by 13 years ago) (diff) | ,
---|
Nutch Installation and Search
Nutch Download
Grab the latest nutch package from the nutch svn repository: http://svn.apache.org/repos/asf/nutch/ Version 1.1 is is the latest stable branch as of this writing.
svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.1
Unpackage nutch into /usr/local/nutch/.
Tomcat Installation
Go to http://tomcat.apache.org/download-60.cgi to get tomcat from a mirror
> cd /usr/local > wget <<mirror>>/apache-tomcat-6.0.xx.tar.gz > tar zxvf apache-tomcat-6.0.xx.tar.gz > ln -s apache-tomcat-6.0.xx tomcat
Start and stop Tomcat:
/usr/local/tomcat/bin/catalina.sh stop /usr/local/tomcat/bin/catalina.sh start
Add PDF Support - Rebuild Nutch
Download pdf libraries
cd src/plugin/parse-pdf/lib wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar
In src/plugin/parse-pdf/plugin.xml
<!-- Uncomment the following two lines after you have downloaded the libraries, see README.txt for more details.--> <library name="jai_codec.jar"/> <library name="jai_core.jar"/>
Make sure ant/compiler is installed:
apt-get install ant apt-get install default-jdk
Rebuild Nutch:
cd /usr/local/nutch/ ant jar ant compile-plugins ant war
Copy nutch war file to tomcat directory, naming it ROOT.war:
cp /usr/local/nutch/build/nutch-1.2-dev.war /usr/local/tomcat/webapps/ROOT.war
Update tomcat nutch instance so that it uses the correct crawl directory: Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<configuration> <property> <name>searcher.dir</name> <value>/usr/wwwapps/crawl_data/</value> <description>Path to root of crawl</description> </property> </configuration>
Edit Nutch Conf Files
Edit regex-urlfilter.txt:
# skip http: ftp: and mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # Do not follow backwards -/\.\.(|/)$ # These confine nutch crawl to the following areas: +^file:///FILES/ +^file:/FILES/ # skip everything else -.
Edit crawl-urlfilter.txt:
# skip file:, ftp:, & mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # These confine nutch crawl to the following areas: +^file:///FILES/.* +^file:/FILES/.* # skip everything else -.
Edit conf/nutch-site.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Enable file plugin.</description> </property> <property> <name>file.content.limit</name> <value>-1</value> <description>No length limit for crawled content.</description> </property> <property> <name>http.agent.name</name> <value>Nutch</value> </property> <property> <name>http.agent.description</name> <value>Nutch Crawler</value> </property> <property> <name>http.agent.url</name> <value></value> </property> <property> <name>http.agent.email</name> <value></value> </property> <property> <name>searcher.dir</name> <value>/usr/wwwapps/crawl_data/</value> <description>Path to root of crawl</description> </property> </configuration>
Prepare for a Crawl
First create urls/files so that nutch knows what to crawl. This file can be anywhere, but I'll put it in the main crawl_data directory for now.
mkdir /usr/wwwapps/crawl_data/urls/
Create/edit file called files in urls directory. Add the following line to search the /FILES/ directory:
file:///FILES/
Add nutch link to /usr/bin/ (just for convenience)
ln -s /usr/local/nutch/bin/nutch /usr/bin/nutch
Run a Crawl
nutch crawl /usr/wwwapps/crawl_data/urls -dir /usr/wwwapps/crawl_data/
Check for results using NutchBean
nutch org.apache.nutch.searcher.NutchBean <Search Term>
Test Nutch
Test nutch to see if it is performing correctly. We want to make sure it doesn't jump directories, and only crawls directories that we specify. We also want to make sure it doesn't follow symlinks.
Lets crawl the /usr/share/doc directory and create some test files.
- Create haystack.txt in /usr/share/doc/sed/
Nutch has found the haystack. haystack9
- Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/
Symlink to /needle.txt Here is a needle: needle008
Symlink command:
/usr/share/doc/wget/ ln -s /needle.txt needle.txt
- Create jump.txt in /usr/share/
nutch... you naughty crawler! the secret code is meatball33
Do a crawl on the /usr/doc/share:
- Edit crawl-urlfilter.txt and regex-urlfilter.txt Change top level directory filter to: {{{ +file:///usr/share/doc/ +file:/usr/share/doc/ }}}
- Create a file called docurls with this line:
file:///usr/share/doc/
- Make sure we are using the unaltered nutch:
which nutch
Output is /usr/bin/nutch which is a symlink to /usr/local/nutch/bin/nutch. - Run the crawl:
nutch crawl urls.doc -dir crawl.doc > crawl.doc.log
Results of Test
First, looking through the crawl.doc.log we don't see any signs that nutch has crawled either the symlink file "needle.txt" or the jump file "jump.txt". If you search for these two names with the log open in vim, neither name is found. The haystack.txt file is found in the log:
1203 fetching file:/usr/share/doc/sed/haystack.txt
From visually scanning the file, it doesn't appear that nutch is crawling any higher directories.
Next, try searching the crawldb using NutchBean. Here are the results for a few different searches. Remember to point the searcher.dir in nutch-site.xml to the crawl directory, crawl.doc.
Look for haystack:
> nutch org.apache.nutch.searcher.NutchBean haystack9 Total hits: 1 0 20090417072918/file:/usr/share/doc/sed/haystack.txt ... found the haystack. haystack9
Look for needle:
> nutch org.apache.nutch.searcher.NutchBean needle.txt Total hits: 0 > nutch org.apache.nutch.searcher.NutchBean needle008 Total hits: 0
Look for jump:
> nutch org.apache.nutch.searcher.NutchBean jump.txt Total hits: 0 > nutch org.apache.nutch.searcher.NutchBean meatball33 Total hits: 0
In conclusion, it doesn't appear that nutch is interested in the symlinks, nor is it jumping into a higher directory.
Helpful Articles
Here are some helpful articles about setting up Nutch, crawling, and recrawling.
Introduction to Nutch, Part 1: Crawling |
Introduction to Nutch, Part 2: Searching |
Crawling the local filesystem with nutch |
Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary. This can be