wiki:administration/search
close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Version 4 (modified by hank, 13 years ago) (diff)

--

Nutch Installation and Search

Nutch Download

Grab the latest nutch package from the nutch site. Version 1.0 is is the latest as of this writing. Unpackage nutch into /usr/local/nutch/.

Add PDF Support - Rebuild Nutch

Download pdf libraries

cd src/plugin/parse-pdf/lib
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar

In src/plugin/parse-pdf/plugin.xml

<!-- Uncomment the following two lines after you have downloaded the
     libraries, see README.txt for more details.-->
<library name="jai_codec.jar"/>
<library name="jai_core.jar"/>

Rebuild Nutch:

cd /usr/local/nutch/
ant jar
ant compile-plugins
ant war

Copy nutch war file to tomcat directory, naming it ROOT.war:

cp /usr.local/nutch/nutch-1.0.war /usr/local/tomcat/webapps/ROOT.war

Update tomcat nutch instance so that it uses the correct crawl directory: Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml

<configuration>
  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>
</configuration>

Edit Nutch Conf Files

Edit regex-urlfilter.txt:

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# Do not follow backwards
-/\.\.(|/)$

# These confine nutch crawl to the following areas:
+^file:///FILES/
+^file:/FILES/

# skip everything else
-.

Edit crawl-urlfilter.txt:

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# These confine nutch crawl to the following areas:
+^file:///FILES/.*
+^file:/FILES/.*

# skip everything else
-.

Edit conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>plugin.includes</name> 
    <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Enable file plugin.</description>
  </property> 

  <property>
    <name>file.content.limit</name>
    <value>-1</value>
    <description>No length limit for crawled content.</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Nutch</value>
  </property>

  <property>
    <name>http.agent.description</name>
    <value>Nutch Crawler</value>
  </property>

  <property>
    <name>http.agent.url</name>
    <value></value>
  </property>

  <property>
    <name>http.agent.email</name>
    <value></value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>

</configuration>

Prepare for a Crawl

First create urls/files so that nutch knows what to crawl. This file can be anywhere, but I'll put it in the main crawl_data directory for now.

mkdir /usr/wwwapps/crawl_data/urls/

Create/edit file called files in urls directory. Add the following line to search the /FILES/ directory:

file:///FILES/

Add nutch link to /usr/bin/ (just for convenience)

ln -s /usr/bin/nutch /usr/local/nutch/bin/nutch

Run a Crawl

nutch crawl /usr/wwwapps/crawl_data/urls -dir /usr/wwwapps/crawl_data/

Check for results using NutchBean?

nutch org.apache.nutch.searcher.NutchBean <Search Term>

Helpful Articles

Here are some helpful articles about setting up Nutch, crawling, and recrawling.

Introduction to Nutch, Part 1: Crawling
Introduction to Nutch, Part 2: Searching
Crawling the local filesystem with nutch

Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary. This can be

done by specifying the correct regular expressions in the conf files.