Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Documented DNS, Agent spider matching, tidied usages, specified IP file format

...

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-b or --reindex-bitstreams

Reindex the bitstreams to ensure we have the bundle name
-r or --remove-deleted-bitstreams

While indexing the bundle names remove the statistics about deleted bitstreams

-u or --update-spider-files

Update Spider IP Files from internet into /[dspace]/config/spiders. Downloads Spider files identified in dspace.cfg under property solr.spiderips.urls. See Configuration settings for Statistics

-f or --delete-spiders-by-flag

Delete Spiders in Solr By isBot Flag. Will prune out all records that have isBot:true

-i or --delete-spiders-by-ip

Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that have IP's that match spider IPsidentification patterns.

-m or --mark-spiders

Update isBog Flag in Solr. Marks any records currently stored in statistics that have IP addresses matched in spiders files

-h or --help

Calls up this brief help table at command line.

...

The usage of these options is open for the user to choose, If they you want to keep spider entires entries in their your repository, they you can just mark them using "-m" and they will be excluded from statistics queries when "solr.statistics.query.filter.isBot = true" in the dspace.cfg. If they you want to keep the spiders out of the solr repository, they can run just use the "-i" option and they will be removed immediately.

Spider IPs are specified in files containing one pattern per line.  A line may be a comment (starting with "#" in column 1), empty, or a single IP address or DNS name.  If a name is given, it will be resolved to an address.  Unresolvable names are discarded and will be noted in the log.

There are guards in place to control what can be defined as an IP range for a bot. In [dspace]/config/spiders, spider IP address ranges have to be at least 3 subnet sections in length 123.123.123 and IP Ranges can only be on the smallest subnet [123.123.123.0 - 123.123.123.255]. If not, loading that row will cause exceptions in the dspace logs and exclude that IP entry.

Spiders may also be excluded by DNS name or Agent header value.  Place files of patterns in [dspace]/config/spiders/domains and/or [dspace]/config/spiders/agents.  Each line in a pattern file should be either empty, a comment starting with "#" in column 1, or a regular expression which matches some names to be recognized as spiders.

Routine Solr Index Maintenance

...

Technical implementation details

After sharding, the SOLR data cores are located in the [dspace.dir]/solr directory. There is no need to define the location of each individual core in solr.xml because they are automatically retrieved at runtime. This retrieval happens in the static method located in theorg.dspace.statistics.SolrLogger class. These cores are stored in the statisticYearCores list each time a query is made to the solr these cores are added as shards by the addAdditionalSolrYearCores method. The cores share a common configuration copied from your original statistics core. Therefore, no issues should be resulting from subsequent ant updates.

The actual sharding of the of the original solr core into individual cores by year is done in the shardSolrIndex method in the org.dspace.statistics.SolrLogger class. The sharding is done by first running a facet on the time to get the facets split by year. Once we have our years from our logs we query the main solr data server for all information on each year & download these as csv's. When we have all data for one year we upload it to the newly created core of that year by using the update csvhandler. One all data of one year has been uploaded that data is removed from the main solr (by doing it this way if our solr crashes we do not need to start from scratch).