KazMuzik.net
Music / Technology / Healthcare / Immigration / アメリカ
Google
 
<< HTML per Tag - Kaz Muzik Blog Backup Project #24Apache Ant - Kaz Muzik Blog Backup Project #23 >>

Nutch subcommands - Kaz Muzik Blog Backup Project #25 - KazMuzik Blog
2007-07-04 21:44

昨日の朝の fetch は、400 ページ中 54 ページとれていないので、回復しておきます。

まず、updatedb サブコマンドを用いて、fetch したセグメントの情報で、crawl 用データベースをアップデートします。
$ bin/nutch readdb tmp-crawl -stats
CrawlDb statistics start: tmp-crawldb
Statistics for CrawlDb: tmp-crawldb
TOTAL urls:     400
retry 0:        400
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        400
CrawlDb statistics: done
$ touch tmp-segments/20070703074227/fetcher.done
$ bin/nutch updatedb tmp-crawldb -dir tmp-segments -filter -noAdditions
...
$ bin/nutch readdb tmp-crawl -stats
CrawlDb statistics start: tmp-crawldb
Statistics for CrawlDb: tmp-crawldb
TOTAL urls:     400
retry 0:        399
retry 1:        1
min score:      1.0
avg score:      1.014
max score:      1.092
status 1 (db_unfetched):        54
status 2 (db_fetched):  346
CrawlDb statistics: done
$

54 ページが unfetched になったので、これから新しいセグメントを generate して、fetch します。
$ bin/nutch generate tmp-crawldb tmp-segments
...
Generator: segment: tmp-segments/20070704204912
...
$ bin/nutch fetch tmp-segments/20070704204912
Fetcher: starting
Fetcher: segment: tmp-segments/20070704204912
Fetcher: threads: 10
fetching http://kazuomik.livejournal.com/88261.html
...
fetching http://kazuomik.livejournal.com/7327.html
Fetcher: done
$ bin/nutch readseg -list -dir tmp-segments
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20070703074227  400        2007-07-03T07:43:29  2007-07-03T08:18:01  400     346
20070704204912  54         2007-07-04T20:50:09  2007-07-04T20:54:51  54      50
$ 

あと 4 ページです。もう一度、同様に行います。
$ bin/nutch updatedb tmp-crawldb -dir tmp-segments -filter -noAdditions
...
$ bin/nutch generate tmp-crawldb tmp-segments
...
Generator: segment: tmp-segments/20070704205943
...
$ bin/nutch fetch tmp-segments/20070704205943
Fetcher: starting
Fetcher: segment: tmp-segments/20070704205943
Fetcher: threads: 10
fetching http://kazuomik.livejournal.com/88261.html
fetching http://kazuomik.livejournal.com/69271.html
fetching http://kazuomik.livejournal.com/39676.html
fetching http://kazuomik.livejournal.com/17012.html
Fetcher: done
$ bin/nutch readseg -list -dir tmp-segments
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20070703074227  400        2007-07-03T07:43:29  2007-07-03T08:18:01  400     346
20070704204912  54         2007-07-04T20:50:09  2007-07-04T20:54:51  54      50
20070704205943  4          2007-07-04T21:00:16  2007-07-04T21:00:33  4       4
$ 

400 ページすべて fetch できました。一応、Crawl 用データベースをアップデートして、セグメントを merge しておきます。
$ bin/nutch updatedb tmp-crawldb -dir tmp-segments -filter -noAdditions
...
$ bin/nutch readdb tmp-crawldb -stats
...
TOTAL urls:     400
...
status 2 (db_fetched):  400
CrawlDb statistics: done
$ bin/nutch mergesegs tmp-merged -dir tmp-segments
...
$ bin/nutch readseg -list -dir tmp-merged
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20070704210407  400        2007-07-03T07:43:29  2007-07-04T21:00:33  400     400
$

今日のアップデートも反映させておきます。
$ mkdir tmp-urls
$ cp /home/kaz/kazmuzikblog/urls-for-nutch.txt tmp-urls
$ bin/nutch inject tmp-crawldb tmp-urls
...
$ bin/nutch generate tmp-crawldb tmp-segments
...
Generator: segment: tmp-segments/20070704211457
...
$ bin/nutch fetch tmp-segments/20070704211457
Fetcher: starting
Fetcher: segment: tmp-segments/20070704211457
Fetcher: threads: 10
fetching http://kazuomik.livejournal.com/104659.html
fetching http://kazuomik.livejournal.com/103902.html
fetching http://kazuomik.livejournal.com/104752.html
fetching http://kazuomik.livejournal.com/104291.html
fetching http://kazuomik.livejournal.com/104057.html
Fetcher: done
$ bin/nutch readseg -list -dir tmp-segments
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20070703074227  400        2007-07-03T07:43:29  2007-07-03T08:18:01  400     346
20070704204912  54         2007-07-04T20:50:09  2007-07-04T20:54:51  54      50
20070704205943  4          2007-07-04T21:00:16  2007-07-04T21:00:33  4       4
20070704211457  5          2007-07-04T21:15:27  2007-07-04T21:15:48  5       5
$ rm -rf tmp-merged
$ bin/nutch mergesegs tmp-merged -dir tmp-segments
...
$ bin/nutch readseg -list -dir tmp-merged
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20070704211743  405        2007-07-03T07:43:29  2007-07-04T21:15:48  405     405
$ bin/nutch updatedb tmp-crawldb -dir tmp-segments -filter -noAdditions
...
$ bin/nutch readdb tmp-crawldb -stats
..
TOTAL urls:     405
..
status 2 (db_fetched):  405
CrawlDb statistics: done
$

405 ページの fetch したコンテンツがひとつのセグメントに merge され、crawl 用データベースもアップデートされました。

新しいディレクトリを作って、保存しておきます。
$ mkdir crawl-kazmuzikblog
$ mv tmp-crawldb crawl-kazmuzikblog/crawldb
$ mv tmp-merged crawl-kazmuzikblog/segments
$ ls -l crawl-kazmuzikblog
total 3
drwxr-xr-x+ 3 kaz  kaz  3 2007-07-04 21:20 crawldb
drwxr-xr-x+ 3 kaz  kaz  3 2007-07-04 21:18 segments
$ rm -rf tmp-segments
$ cat tmp-urls/urls-for-nutch.txt urls-kazmuzikblog/nutch \
  > crawl-kazmuzikblog/urls-kazmuzikblog.txt
$ wc -l crawl-kazmuzikblog/urls-kazmuzikblog.txt
405 crawl-kazmuzikblog/urls-kazmuzikblog.txt
$

Tags: computer_technology