wiki:administration/search
close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Version 2 (modified by hank, 13 years ago) (diff)

work on nutch

/etc/init.d/tomcat
is set to /usr/wwwapps/crawldir

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

nutch-site.xml

add:

<property>
	<name>searcher.dir</name>
	<value>/usr/wwwapps/crawl_data/crawl</value>
	<description>Path to root of crawl</description>
</property>

recreate crawl dir:

mkdir /usr/wwwapps/crawl_data/
mkdir urls
create files
add file:///FILES/

New Nutch: wget http://nutch

edit regex-urlfilter.txt [post here]

edit crawl-urlfilter.txt [post here]

edit nutch-site.xml [post here]

download pdf libraries

cd src/plugin/parse-pdf/lib
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar
wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar

In src/plugin/parse-pdf/plugin.xml

<!-- Uncomment the following two lines after you have downloaded the
     libraries, see README.txt for more details.-->
<library name="jai_codec.jar"/>
<library name="jai_core.jar"/>

Rebuild Nutch:

cd ..nutch-1.0/
ant jar
ant compile-plugins
ant war