wiki:administration/search
close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Version 15 (modified by hank, 12 years ago) (diff)

--

Nutch Installation and Search

Nutch Download

Grab the latest nutch package from the nutch svn repository: http://svn.apache.org/repos/asf/nutch/ Version 1.1 is is the latest stable branch as of this writing.

svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.1

Unpackage nutch into /usr/local/nutch/.

Tomcat Installation

Go to http://tomcat.apache.org/download-60.cgi to get tomcat from a mirror

> cd /usr/local
> wget <<mirror>>/apache-tomcat-6.0.xx.tar.gz
> tar zxvf apache-tomcat-6.0.xx.tar.gz
> ln -s apache-tomcat-6.0.xx tomcat

Start and stop Tomcat:

/usr/local/tomcat/bin/catalina.sh stop
/usr/local/tomcat/bin/catalina.sh start

Add PDF Support - Rebuild Nutch

Download pdf libraries

cd src/plugin/parse-pdf/lib
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar

In src/plugin/parse-pdf/plugin.xml

<!-- Uncomment the following two lines after you have downloaded the
     libraries, see README.txt for more details.-->
<library name="jai_codec.jar"/>
<library name="jai_core.jar"/>

Make sure ant/compiler is installed:

apt-get install ant
apt-get install default-jdk

Rebuild Nutch:

cd /usr/local/nutch/
ant jar
ant compile-plugins
ant war

Copy nutch war file to tomcat directory, naming it ROOT.war:

cp /usr/local/nutch/build/nutch-1.2-dev.war /usr/local/tomcat/webapps/ROOT.war

Update tomcat nutch instance so that it uses the correct crawl directory: Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml

<configuration>
  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>
</configuration>

Edit Nutch Conf Files

Edit regex-urlfilter.txt:

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# Do not follow backwards
-/\.\.(|/)$

# These confine nutch crawl to the following areas:
+^file:///FILES/
+^file:/FILES/

# skip everything else
-.

Edit crawl-urlfilter.txt:

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# These confine nutch crawl to the following areas:
+^file:///FILES/.*
+^file:/FILES/.*

# skip everything else
-.

Edit conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>plugin.includes</name> 
    <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Enable file plugin.</description>
  </property> 

  <property>
    <name>file.content.limit</name>
    <value>-1</value>
    <description>No length limit for crawled content.</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Nutch</value>
  </property>

  <property>
    <name>http.agent.description</name>
    <value>Nutch Crawler</value>
  </property>

  <property>
    <name>http.agent.url</name>
    <value></value>
  </property>

  <property>
    <name>http.agent.email</name>
    <value></value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>

</configuration>

Prepare for a Crawl

First create crawl_urls/files so that nutch knows what to crawl. This file can be anywhere,

mkdir /usr/wwwapps/crawl_urls/

Create/edit file called files in crawl_urls directory. Add the following line to search the /FILES/ directory:

file:///FILES/

Add nutch link to /usr/bin/ (just for convenience)

ln -s /usr/local/nutch/bin/nutch /usr/bin/nutch 

Run a Crawl

nutch crawl /usr/wwwapps/crawl_urls -dir /usr/wwwapps/crawl_data/

Check for results using NutchBean

nutch org.apache.nutch.searcher.NutchBean <Search Term>

Test Nutch

Test nutch to see if it is performing correctly. We want to make sure it doesn't jump directories, and only crawls directories that we specify. We also want to make sure it doesn't follow symlinks.

Lets crawl the /usr/share/doc directory and create some test files.

  1. Create haystack.txt in /usr/share/doc/sed/
    Nutch has found the haystack.
    haystack9
    
  1. Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/
    Symlink to /needle.txt
    Here is a needle:
    needle008
    

Symlink command:

/usr/share/doc/wget/ ln -s /needle.txt needle.txt
  1. Create jump.txt in /usr/share/
    nutch... you naughty crawler!
     the secret code is meatball33
    

Do a crawl on the /usr/doc/share:

  1. Edit crawl-urlfilter.txt and regex-urlfilter.txt Change top level directory filter to: {{{ +file:///usr/share/doc/ +file:/usr/share/doc/ }}}
  2. Create a file called docurls with this line: file:///usr/share/doc/
  3. Make sure we are using the unaltered nutch:
    which nutch
    
    Output is /usr/bin/nutch which is a symlink to /usr/local/nutch/bin/nutch.
  4. Run the crawl:
    nutch crawl urls.doc -dir crawl.doc > crawl.doc.log
    

Results of Test

First, looking through the crawl.doc.log we don't see any signs that nutch has crawled either the symlink file "needle.txt" or the jump file "jump.txt". If you search for these two names with the log open in vim, neither name is found. The haystack.txt file is found in the log:

1203 fetching file:/usr/share/doc/sed/haystack.txt

From visually scanning the file, it doesn't appear that nutch is crawling any higher directories.

Next, try searching the crawldb using NutchBean. Here are the results for a few different searches. Remember to point the searcher.dir in nutch-site.xml to the crawl directory, crawl.doc.

Look for haystack:

> nutch org.apache.nutch.searcher.NutchBean haystack9
Total hits: 1
 0 20090417072918/file:/usr/share/doc/sed/haystack.txt
 ... found the haystack.
haystack9

Look for needle:

> nutch org.apache.nutch.searcher.NutchBean needle.txt
Total hits: 0
> nutch org.apache.nutch.searcher.NutchBean needle008
Total hits: 0

Look for jump:

>  nutch org.apache.nutch.searcher.NutchBean jump.txt
Total hits: 0
>  nutch org.apache.nutch.searcher.NutchBean meatball33
Total hits: 0

In conclusion, it doesn't appear that nutch is interested in the symlinks, nor is it jumping into a higher directory.

Helpful Articles

Here are some helpful articles about setting up Nutch, crawling, and recrawling.

Introduction to Nutch, Part 1: Crawling
Introduction to Nutch, Part 2: Searching
Crawling the local filesystem with nutch

Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary. This can be

done by specifying the correct regular expressions in the conf files.