StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PONutch segments folder grows every day
primarykey
Id
17238813
data
AcceptedAnswerId
17296259
AnswerCount
1
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-06-21T15:19:53.320
FavoriteCount
2
LastActivityDate
2013-06-25T11:37:45
LastEditDate
2013-06-25T08:46:03.563
LastEditorUserId
824846
OwnerUserId
824846
ParentId
0
PostTypeId
1
Score
2
ViewCount
1783
LastEditorDisplayName
text
Body
I have configured nutch/solr 1.6 to crawl/index every 12 hours an intranet with about 4000 documents and html pages. If I execute the crawler with an empty database the process takes about 30 minutes. When the crawling is executed for several days, it becomes very slow. Looking the log file it seems that this night the last step (SolrIndexer) started after 1 hour and 20 minutes and it took a bit more than 1 hour. Because the number of documents indexed doesn't grow, I'm wondering why it is so slow now. Nutch is executed with the following command: <pre><code>bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000 </code></pre> The nutch-site.xml contains: <pre><code><?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>http.agent.name</name> <value>Internet Site Agent</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>  <property> <name>metatags.names</name> <value>description;keywords;published;modified</value> <description> Names of the metatags to extract, separated by;. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description;metatag.keywords'. </description> </property> <property> <name>index.parse.md</name> <value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value> <description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) </description> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>Set this to false if you start crawling your website from for example http://www.example.com but you would like to crawl xyz.example.com. Set it to true otherwise if you want to exclude external links </description> </property> <property> <name>http.content.limit</name> <value>10000000</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> <property> <name>fetcher.max.crawl.delay</name> <value>1</value> <description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description> </property> <property> <name>fetcher.threads.fetch</name> <value>10</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. </description> </property> <property> <name>fetcher.threads.fetch</name> <value>10</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. </description> </property> <property> <name>fetcher.server.delay</name> <value>1.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server.</description> </property> <property> <name>http.redirect.max</name> <value>0</value> <description>The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. </description> </property> <property> <name>fetcher.threads.per.queue</name> <value>2</value> <description>This number is the maximum number of threads that should be allowed to access a queue at one time. Replaces deprecated parameter 'fetcher.threads.per.host'. </description> </property> <property> <name>link.delete.gone</name> <value>true</value> <description>Whether to delete gone pages from the web graph.</description> </property> <property> <name>link.loops.depth</name> <value>20</value> <description>The depth for the loops algorithm.</description> </property>  <property> <name>moreIndexingFilter.indexMimeTypeParts</name> <value>false</value> <description>Determines whether the index-more plugin will split the mime-type in sub parts, this requires the type field to be multi valued. Set to true for backward compatibility. False will not split the mime-type. </description> </property> <property> <name>moreIndexingFilter.mapMimeTypes</name> <value>false</value> <description>Determines whether MIME-type mapping is enabled. It takes a plain text file with mapped MIME-types. With it the user can map both application/xhtml+xml and text/html to the same target MIME-type so it can be treated equally in an index. See conf/contenttype-mapping.txt. </description> </property>  <property> <name>db.fetch.interval.default</name>  <value>10</value> <description>The default number of seconds between re-fetches of a page (less than 1 day). </description> </property> <property> <name>db.fetch.interval.max</name>  <value>10</value> <description>The maximum number of seconds between re-fetches of a page (less than one day). After this period every page in the db will be re-tried, no matter what is its status. </description> </property>  <property> <name>fetcher.threads.fetch</name> <value>1</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/apache-nutch/tmp/</value> </property>  <property> <name>tika.boilerpipe</name> <value>true</value> </property> <property> <name>tika.boilerpipe.extractor</name> <value>ArticleExtractor</value> </property> </configuration> </code></pre> As you can see, I have configured nutch to always refetch all the documents. Because the site is small, it should be ok for now to refetch everything (the first time takes only 30 minutes...). I have noticed that in the folder crawldb/segments every day more or less 40 new segments are created. the disk size of the database of course is growing very fast. Is this the expected behaviour ? Is there something wrong with the configuration?
Tags
<performance><solr><nutch><segments>
Title
Nutch segments folder grows every day
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMarco Altieri
UserOwnerUserId
1. USMarco Altieri
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PONutch segments folder grows every day
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COI have found the answear. It is necessary to manually delete all the segments that are older than the db.default.fetch.interval. In my case... all of them.
 singulars
 PostPostId
 PONutch segments folder grows every day
 UserUserId
 USMarco Altieri
2. COPlease elaborate upon your answer and post it as an Answer. Then accept it as the correct answer. That'd help others looking for similar issues. If you found the answer via other sources (StackOverflow or otherwise) please post those links, and if necessary relevant excerpts.
 singulars
 PostPostId
 PONutch segments folder grows every day
 UserUserId
 USzEro

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.