solr - How do you configure Apache Nutch 2.3 to honour robots metatag? -


i have nutch 2.3 setup hbase backend , run crawl of includes index solr , solr deduplication.

i have noticed solr index contains unwanted webpages.

in order nutch ignore these webpages set following metatag:

<meta name="robots" content="noindex,follow">  

i have visited apache nutch official website , explains following:

if not have permission edit /robots.txt file on server, can still tell robots not index pages or follow links. standard mechanism robots meta tag

searching web answers, found recommendations set protocol.check_robots or set protocol.plugin.check.robots property in nutch-site.xml. none of these appear work.

at current nutch 2.3 ignores noindex rule, therefore indexing content external datastore ie solr.

the question how configure nutch 2.3 honour robots metatags?

also if nutch 2.3 configured ignore robot metatag , during previous crawl cycle indexed webpage. providing rules robots metatag correct, result in page being removed solr index in future crawls?

i've created plugin overcome problem of apache nutch 2.3 not honouring robots metatag rule noindex. metarobots plugin forces nutch discard qualifying documents during index. prevents qualifying documents being indexed external datastore ie solr.

please note: plugin prevents index of documents contain robots metatag rule noindex, not remove documents indexed external datastore.

visit link instructions


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -