close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Version 2 (modified by hank, 13 years ago) (diff)

work on nutch

is set to /usr/wwwapps/crawldir



	<description>Path to root of crawl</description>

recreate crawl dir:

mkdir /usr/wwwapps/crawl_data/
mkdir urls
create files
add file:///FILES/

New Nutch: wget http://nutch

edit regex-urlfilter.txt [post here]

edit crawl-urlfilter.txt [post here]

edit nutch-site.xml [post here]

download pdf libraries

cd src/plugin/parse-pdf/lib

In src/plugin/parse-pdf/plugin.xml

<!-- Uncomment the following two lines after you have downloaded the
     libraries, see README.txt for more details.-->
<library name="jai_codec.jar"/>
<library name="jai_core.jar"/>

Rebuild Nutch:

cd ..nutch-1.0/
ant jar
ant compile-plugins
ant war