close
Warning:
Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.
- Timestamp:
-
Apr 12, 2009, 3:59:58 PM (14 years ago)
- Author:
-
hank
- Comment:
-
work on nutch
Legend:
- Unmodified
- Added
- Removed
- Modified
-
v1
|
v2
|
|
2 | 2 | is set to /usr/wwwapps/crawldir |
3 | 3 | |
4 | | http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html[[BR]] |
| 4 | http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html |
| 5 | |
5 | 6 | http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html |
6 | 7 | |
… |
… |
|
23 | 24 | create files[[BR]] |
24 | 25 | add file:///FILES/ |
| 26 | |
| 27 | |
| 28 | |
| 29 | New Nutch: |
| 30 | wget http://nutch |
| 31 | |
| 32 | edit regex-urlfilter.txt [post here] |
| 33 | |
| 34 | edit crawl-urlfilter.txt [post here] |
| 35 | |
| 36 | edit nutch-site.xml [post here] |
| 37 | |
| 38 | download pdf libraries |
| 39 | {{{ |
| 40 | cd src/plugin/parse-pdf/lib |
| 41 | wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_codec.jar |
| 42 | wget http://pdfbox.cvs.sourceforge.net/viewvc/*checkout*/pdfbox/pdfbox/external/jai_core.jar |
| 43 | }}} |
| 44 | In src/plugin/parse-pdf/plugin.xml |
| 45 | {{{ |
| 46 | <!-- Uncomment the following two lines after you have downloaded the |
| 47 | libraries, see README.txt for more details.--> |
| 48 | <library name="jai_codec.jar"/> |
| 49 | <library name="jai_core.jar"/> |
| 50 | }}} |
| 51 | |
| 52 | Rebuild Nutch: |
| 53 | {{{ |
| 54 | cd ..nutch-1.0/ |
| 55 | ant jar |
| 56 | ant compile-plugins |
| 57 | ant war |
| 58 | }}} |