wiki:administration/search
close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Version 3 (modified by hank, 13 years ago) (diff)

Update nutch seach install

Nutch Installation and Search

Nutch Download

Grab the latest nutch package from the nutch site. Version 1.0 is is the latest as of this writing. Unpackage nutch into /usr/local/nutch/.

Add PDF Support - Rebuild Nutch

Download pdf libraries

cd src/plugin/parse-pdf/lib
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar

In src/plugin/parse-pdf/plugin.xml

<!-- Uncomment the following two lines after you have downloaded the
     libraries, see README.txt for more details.-->
<library name="jai_codec.jar"/>
<library name="jai_core.jar"/>

Rebuild Nutch:

cd /usr/local/nutch/
ant jar
ant compile-plugins
ant war

Copy nutch war file to tomcat directory, naming it ROOT.war:

cp /usr.local/nutch/nutch-1.0.war /usr/local/tomcat/webapps/ROOT.war

Update tomcat nutch instance so that it uses the correct crawl directory: Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml

<configuration>
  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>
</configuration>

Edit Nutch Conf Files

Edit regex-urlfilter.txt:

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# Do not follow backwards
-/\.\.(|/)$

# These confine nutch crawl to the following areas:
+^file:///FILES/
+^file:/FILES/

# skip everything else
-.

Edit crawl-urlfilter.txt:

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# These confine nutch crawl to the following areas:
+^file:///FILES/.*
+^file:/FILES/.*

# skip everything else
-.

Edit conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>plugin.includes</name> 
    <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Enable file plugin.</description>
  </property> 

  <property>
    <name>file.content.limit</name>
    <value>-1</value>
    <description>No length limit for crawled content.</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Nutch</value>
  </property>

  <property>
    <name>http.agent.description</name>
    <value>Nutch Crawler</value>
  </property>

  <property>
    <name>http.agent.url</name>
    <value></value>
  </property>

  <property>
    <name>http.agent.email</name>
    <value></value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/usr/wwwapps/crawl_data/</value>
    <description>Path to root of crawl</description>
  </property>

</configuration>

Prepare for a Crawl

First create urls/files so that nutch knows what to crawl. This file can be anywhere, but I'll put it in the main crawl_data directory for now.

mkdir /usr/wwwapps/crawl_data/urls/

Create/edit file called files in urls directory. Add the following line to search the /FILES/ directory:

file:///FILES/

Add nutch link to /usr/bin/ (just for convenience)

ln -s /usr/bin/nutch /usr/local/nutch/bin/nutch

Run a Crawl

nutch crawl /usr/wwwapps/crawl_data/urls -dir /usr/wwwapps/crawl_data/

Check for results using NutchBean?

nutch 

Helpful Articles

Here are some helpful articles about setting up Nutch, crawling, and recrawling.

Introduction to Nutch, Part 1: Crawling
Introduction to Nutch, Part 2: Searching
Crawling the local filesystem with nutch

Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary. This can be

done by specifying the correct regular expressions in the conf files.