close Warning: Can't synchronize with repository "(default)" (/usr/svn/silverfile does not appear to be a Subversion repository.). Look in the Trac log for more information.

Changes between Version 6 and Version 7 of administration/search


Ignore:
Timestamp:
Apr 17, 2009, 2:51:38 PM (13 years ago)
Author:
hank
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • administration/search

    v6 v7  
    167167}}}
    168168
    169 == Check for results using NutchBean ==
     169== Check for results using !NutchBean ==
    170170{{{
    171171nutch org.apache.nutch.searcher.NutchBean <Search Term>
     
    187187  }}}
    188188
    189  1. Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/
     189 2. Create needle.txt in / and then create a symlink to it in /usr/share/doc/wget/
    190190 {{{
    191191Symlink to /needle.txt
     
    199199  }}}
    200200
    201  1. Create jump.txt in /usr/share/
     201 3. Create jump.txt in /usr/share/
    202202  {{{
    203203nutch... you naughty crawler!
     
    207207Do a crawl on the /usr/doc/share:
    208208 1. Edit crawl-urlfilter.txt and regex-urlfilter.txt
    209     Change top level directory to :
    210     +^file:///usr/share/doc/
    211     +^file:/usr/share/doc/
    212  1. Create file docurls with:
    213     file:///usr/share/doc/
     209    Change top level directory filter to:
     210    {{{ +^file:///usr/share/doc/
     211    +^file:/usr/share/doc/ }}}
     212 1. Create a file called docurls with this line:
     213    {{{ file:///usr/share/doc/ }}}
     214 1. Make sure we are using the unaltered nutch:
     215    {{{
     216    which nutch
     217    }}}
     218    Output is /usr/bin/nutch which is a symlink to /usr/local/nutch/bin/nutch.
    214219 1. Run the crawl:
    215220    {{{
    216221    nutch crawl urls.doc -dir crawl.doc > crawl.doc.log
    217222    }}}
     223
     224== Results of Test ==
     225First, looking through the crawl.doc.log we don't see any signs that nutch has crawled either the symlink file "needle.txt" or the
     226jump file "jump.txt".  If you search for these two names with the log open in vim, neither name is found. The haystack.txt file
     227is found in the log:
     228{{{
     2291203 fetching file:/usr/share/doc/sed/haystack.txt
     230}}}
     231From visually scanning the file, it doesn't appear that nutch is crawling any higher directories.
     232
     233Next, try searching the crawldb using !NutchBean.  Here are the results for a few different searches.  Remember to point the
     234searcher.dir in nutch-site.xml to the crawl directory, crawl.doc. 
     235
     236Look for haystack:
     237{{{
     238> nutch org.apache.nutch.searcher.NutchBean haystack9
     239Total hits: 1
     240 0 20090417072918/file:/usr/share/doc/sed/haystack.txt
     241 ... found the haystack.
     242haystack9
     243}}}
     244 
     245Look for needle:
     246{{{
     247> nutch org.apache.nutch.searcher.NutchBean needle.txt
     248Total hits: 0
     249> nutch org.apache.nutch.searcher.NutchBean needle008
     250Total hits: 0
     251}}}
     252
     253Look for jump:
     254{{{
     255>  nutch org.apache.nutch.searcher.NutchBean jump.txt
     256Total hits: 0
     257>  nutch org.apache.nutch.searcher.NutchBean meatball33
     258Total hits: 0
     259}}}
     260
     261In conclusion, it doesn't appear that nutch is interested in the symlinks, nor is it jumping into a higher directory. 
     262
    218263
    219264