close
Warning:
Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.
- Timestamp:
-
Apr 17, 2009, 2:51:38 PM (14 years ago)
- Author:
-
hank
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
v6
|
v7
|
|
167 | 167 | }}} |
168 | 168 | |
169 | | == Check for results using NutchBean == |
| 169 | == Check for results using !NutchBean == |
170 | 170 | {{{ |
171 | 171 | nutch org.apache.nutch.searcher.NutchBean <Search Term> |
… |
… |
|
187 | 187 | }}} |
188 | 188 | |
189 | | 1. Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/ |
| 189 | 2. Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/ |
190 | 190 | {{{ |
191 | 191 | Symlink to /needle.txt |
… |
… |
|
199 | 199 | }}} |
200 | 200 | |
201 | | 1. Create jump.txt in /usr/share/ |
| 201 | 3. Create jump.txt in /usr/share/ |
202 | 202 | {{{ |
203 | 203 | nutch... you naughty crawler! |
… |
… |
|
207 | 207 | Do a crawl on the /usr/doc/share: |
208 | 208 | 1. Edit crawl-urlfilter.txt and regex-urlfilter.txt |
209 | | Change top level directory to : |
210 | | +^file:///usr/share/doc/ |
211 | | +^file:/usr/share/doc/ |
212 | | 1. Create file docurls with: |
213 | | file:///usr/share/doc/ |
| 209 | Change top level directory filter to: |
| 210 | {{{ +^file:///usr/share/doc/ |
| 211 | +^file:/usr/share/doc/ }}} |
| 212 | 1. Create a file called docurls with this line: |
| 213 | {{{ file:///usr/share/doc/ }}} |
| 214 | 1. Make sure we are using the unaltered nutch: |
| 215 | {{{ |
| 216 | which nutch |
| 217 | }}} |
| 218 | Output is /usr/bin/nutch which is a symlink to /usr/local/nutch/bin/nutch. |
214 | 219 | 1. Run the crawl: |
215 | 220 | {{{ |
216 | 221 | nutch crawl urls.doc -dir crawl.doc > crawl.doc.log |
217 | 222 | }}} |
| 223 | |
| 224 | == Results of Test == |
| 225 | First, looking through the crawl.doc.log we don't see any signs that nutch has crawled either the symlink file "needle.txt" or the |
| 226 | jump file "jump.txt". If you search for these two names with the log open in vim, neither name is found. The haystack.txt file |
| 227 | is found in the log: |
| 228 | {{{ |
| 229 | 1203 fetching file:/usr/share/doc/sed/haystack.txt |
| 230 | }}} |
| 231 | From visually scanning the file, it doesn't appear that nutch is crawling any higher directories. |
| 232 | |
| 233 | Next, try searching the crawldb using !NutchBean. Here are the results for a few different searches. Remember to point the |
| 234 | searcher.dir in nutch-site.xml to the crawl directory, crawl.doc. |
| 235 | |
| 236 | Look for haystack: |
| 237 | {{{ |
| 238 | > nutch org.apache.nutch.searcher.NutchBean haystack9 |
| 239 | Total hits: 1 |
| 240 | 0 20090417072918/file:/usr/share/doc/sed/haystack.txt |
| 241 | ... found the haystack. |
| 242 | haystack9 |
| 243 | }}} |
| 244 | |
| 245 | Look for needle: |
| 246 | {{{ |
| 247 | > nutch org.apache.nutch.searcher.NutchBean needle.txt |
| 248 | Total hits: 0 |
| 249 | > nutch org.apache.nutch.searcher.NutchBean needle008 |
| 250 | Total hits: 0 |
| 251 | }}} |
| 252 | |
| 253 | Look for jump: |
| 254 | {{{ |
| 255 | > nutch org.apache.nutch.searcher.NutchBean jump.txt |
| 256 | Total hits: 0 |
| 257 | > nutch org.apache.nutch.searcher.NutchBean meatball33 |
| 258 | Total hits: 0 |
| 259 | }}} |
| 260 | |
| 261 | In conclusion, it doesn't appear that nutch is interested in the symlinks, nor is it jumping into a higher directory. |
| 262 | |
218 | 263 | |
219 | 264 | |