close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Changes between Version 2 and Version 3 of administration/search


Ignore:
Timestamp:
Apr 14, 2009, 2:22:03 PM (13 years ago)
Author:
hank
Comment:

Update nutch seach install

Legend:

Unmodified
Added
Removed
Modified
  • administration/search

    v2 v3  
    1 /etc/init.d/tomcat[[BR]]
    2 is set to /usr/wwwapps/crawldir
     1= Nutch Installation and Search =
    32
    4 http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
     3== Nutch Download ==
     4Grab the latest nutch package from the [http://lucene.apache.org/nutch/ nutch site].  Version 1.0 is
     5is the latest as of this writing. Unpackage nutch into '''/usr/local/nutch/'''.
    56
    6 http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
    7 
    8 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
    9 
    10 nutch-site.xml
    11 
    12 add:
    13 {{{
    14 <property>
    15         <name>searcher.dir</name>
    16         <value>/usr/wwwapps/crawl_data/crawl</value>
    17         <description>Path to root of crawl</description>
    18 </property>
    19 }}}
    20 recreate crawl dir:
    21 
    22 mkdir /usr/wwwapps/crawl_data/[[BR]]
    23 mkdir urls[[BR]]
    24 create files[[BR]]
    25 add file:///FILES/
    26 
    27 
    28 
    29 New Nutch:
    30 wget http://nutch
    31 
    32 edit regex-urlfilter.txt [post here]
    33 
    34 edit crawl-urlfilter.txt [post here]
    35 
    36 edit nutch-site.xml [post here]
    37 
    38 download pdf libraries
     7== Add PDF Support - Rebuild Nutch ==
     8Download pdf libraries
    399{{{
    4010cd src/plugin/parse-pdf/lib
     
    4212wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar
    4313}}}
     14
    4415In src/plugin/parse-pdf/plugin.xml
    4516{{{
     
    5223Rebuild Nutch:
    5324{{{
    54 cd ..nutch-1.0/
     25cd /usr/local/nutch/
    5526ant jar
    5627ant compile-plugins
    5728ant war
    5829}}}
     30
     31Copy nutch war file to tomcat directory, naming it ROOT.war:
     32{{{
     33cp /usr.local/nutch/nutch-1.0.war /usr/local/tomcat/webapps/ROOT.war
     34}}}
     35
     36Update tomcat nutch instance so that it uses the correct crawl directory:
     37Edit /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
     38{{{
     39<configuration>
     40  <property>
     41    <name>searcher.dir</name>
     42    <value>/usr/wwwapps/crawl_data/</value>
     43    <description>Path to root of crawl</description>
     44  </property>
     45</configuration>
     46}}}
     47
     48== Edit Nutch Conf Files ==
     49
     50Edit regex-urlfilter.txt:
     51{{{
     52# skip http: ftp: and mailto: urls
     53-^(http|ftp|mailto):
     54
     55# skip image and other suffixes we can't yet parse
     56-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
     57
     58# skip URLs containing certain characters as probable queries, etc.
     59-[?*!@=]
     60
     61# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
     62-.*(/[^/]+)/[^/]+\1/[^/]+\1/
     63
     64# Do not follow backwards
     65-/\.\.(|/)$
     66
     67# These confine nutch crawl to the following areas:
     68+^file:///FILES/
     69+^file:/FILES/
     70
     71# skip everything else
     72-.
     73}}}
     74
     75Edit crawl-urlfilter.txt:
     76{{{
     77# skip file:, ftp:, & mailto: urls
     78-^(http|ftp|mailto):
     79
     80# skip image and other suffixes we can't yet parse
     81-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
     82
     83# skip URLs containing certain characters as probable queries, etc.
     84-[?*!@=]
     85
     86# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
     87-.*(/[^/]+)/[^/]+\1/[^/]+\1/
     88
     89# These confine nutch crawl to the following areas:
     90+^file:///FILES/.*
     91+^file:/FILES/.*
     92
     93# skip everything else
     94-.
     95}}}
     96
     97Edit conf/nutch-site.xml: 
     98{{{
     99<?xml version="1.0"?>
     100<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
     101
     102<!-- Put site-specific property overrides in this file. -->
     103
     104<configuration>
     105
     106  <property>
     107    <name>plugin.includes</name>
     108    <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
     109    <description>Enable file plugin.</description>
     110  </property>
     111
     112  <property>
     113    <name>file.content.limit</name>
     114    <value>-1</value>
     115    <description>No length limit for crawled content.</description>
     116  </property>
     117
     118  <property>
     119    <name>http.agent.name</name>
     120    <value>Nutch</value>
     121  </property>
     122
     123  <property>
     124    <name>http.agent.description</name>
     125    <value>Nutch Crawler</value>
     126  </property>
     127
     128  <property>
     129    <name>http.agent.url</name>
     130    <value></value>
     131  </property>
     132
     133  <property>
     134    <name>http.agent.email</name>
     135    <value></value>
     136  </property>
     137
     138  <property>
     139    <name>searcher.dir</name>
     140    <value>/usr/wwwapps/crawl_data/</value>
     141    <description>Path to root of crawl</description>
     142  </property>
     143
     144</configuration>
     145}}}
     146
     147== Prepare for a Crawl ==
     148First create urls/files so that nutch knows what to crawl.  This file can
     149be anywhere, but I'll put it in the main crawl_data directory for now.
     150{{{
     151mkdir /usr/wwwapps/crawl_data/urls/
     152}}}
     153Create/edit file called '''files''' in urls directory.  Add the following line
     154to search the /FILES/ directory:
     155{{{
     156file:///FILES/
     157}}}
     158Add nutch link to /usr/bin/ (just for convenience)
     159{{{
     160ln -s /usr/bin/nutch /usr/local/nutch/bin/nutch
     161}}}
     162
     163
     164== Run a Crawl ==
     165{{{
     166nutch crawl /usr/wwwapps/crawl_data/urls -dir /usr/wwwapps/crawl_data/
     167}}}
     168
     169== Check for results using NutchBean ==
     170{{{
     171nutch
     172}}}
     173
     174
     175
     176
     177
     178== Helpful Articles ==
     179
     180Here are some helpful articles about setting up Nutch, crawling, and
     181recrawling.
     182 
     183|| [http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html Introduction to Nutch, Part 1: Crawling] ||
     184|| [http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Introduction to Nutch, Part 2: Searching] ||
     185|| [http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch Crawling the local filesystem with nutch] [[BR]]
     186Note- The hack in the above article for keeping nutch from crawling parent directories is not necessary.  This can be
     187done by specifying the correct regular expressions in the conf files.||
     188
     189