Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Documented spider pattern files' location and purposes

...

Add these classes to the css file and apply any style you like (like centering the text or the image)

Recognizing Web Spiders (Bots, Crawlers, etc.)

DSpace can often recognize that a given access request comes from a web spider that is indexing your repository.  These accesses can be flagged for separate treatment (perhaps exclusion) in usage statistics.  This requires patterns to match against incoming requests.  These patterns exist in files that you will find in config/spiders.

In the spiders directory itself, you will find a number of files provided by iplists.com.  These files contain network address patterns which have been discovered to identify a number of known indexing services and other spiders.  You can add your own files here if you wish to exclude more addresses that you know of.  You will need to include your files' names in the list configured in config/modules/solr-statistics.cfg.  The iplists.com-*.txt files can be updated using a tool provided by DSpace.  See SOLR Statistics for details.

In the spiders directory you will also find two subdirectories.  agents contains files filled with regular expressions, one per line.  An incoming request's User-Agent header is tested with each expression found in any of these files until an expression matches.  If there is a match, the request is marked as being from a spider, otherwise not.  domains similarly contains files filled with regular expressions which are used to test the domain name from which the request comes.  You may add your own files of regular expressions to either directory if you wish to test requests with patterns of your own devising.