Version 2 (modified by 14 years ago) (diff) | ,
---|
/etc/init.d/tomcat
is set to /usr/wwwapps/crawldir
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
nutch-site.xml
add:
<property> <name>searcher.dir</name> <value>/usr/wwwapps/crawl_data/crawl</value> <description>Path to root of crawl</description> </property>
recreate crawl dir:
mkdir /usr/wwwapps/crawl_data/
mkdir urls
create files
add file:///FILES/
New Nutch: wget http://nutch
edit regex-urlfilter.txt [post here]
edit crawl-urlfilter.txt [post here]
edit nutch-site.xml [post here]
download pdf libraries
cd src/plugin/parse-pdf/lib wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar
In src/plugin/parse-pdf/plugin.xml
<!-- Uncomment the following two lines after you have downloaded the libraries, see README.txt for more details.--> <library name="jai_codec.jar"/> <library name="jai_core.jar"/>
Rebuild Nutch:
cd ..nutch-1.0/ ant jar ant compile-plugins ant war