Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Remove references to ElasticSearch

Table of Contents
minLevel2
outlinetrue
stylenone

DSpace Log Converter

With the release of DSpace 1.6, new statistics software component was added. The use of Solr for statistics in DSpace makes it possible to have a database of statistics. With this in mind, there is the issue of the older log files and how a site can use them. The following command process is able to convert the existing log files and then import them for Solr use. The user will need to perform this conversion only once.

...

Spiders may also be excluded by DNS name or Agent header value.  Place one or more files of patterns in the directories [dspace]/config/spiders/domains and/or [dspace]/config/spiders/agents.  Each line in a pattern file should be either empty, a comment starting with "#" in column 1, or a regular expression which matches some names to be recognized as spiders.

Export SOLR records to intermediate format for import into

...

another tool/instance

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-e or --export

Export SOLR view statistics data to usage statistics intermediate format

This exports the records to [dspace]/temp/usagestats_0.csv. This will chunk the files at 10,000 records to new files. This can be imported with stats-log-importer to SOLR Statistics or stats-log-importer-elasticsearch to Elasticsearch Usage Statistics (deprecated).

Export SOLR statistics, for backup and moving to another server

...

Technical implementation details

After sharding, the Solr data cores are located in the [dspace.dir]/solr directory. There is no need to define the location of each individual core in solr.xml because they are automatically retrieved at runtime. This retrieval happens in the static method located in the org.dspace.statistics.SolrLogger class. These cores are stored in the statisticYearCores list.  Each time a query is made to Solr, these cores are added as shards by the addAdditionalSolrYearCores method. The cores share a common configuration copied from your original statistics core. Therefore, no issues should be resulting from subsequent ant updates.

The actual sharding of the of the original Solr core into individual cores by year is done in the shardSolrIndex method in the org.dspace.statistics.SolrLogger class. The sharding is done by first running a facet on the time to get the facets split by year. Once we have our years from our logs we query the main Solr data server for all information on each year & download these as CSVs. When we have all data for one year, we upload it to the newly created core of that year by using the update csv handler. Once all data of one year have been uploaded, those data are removed from the main Solr (by doing it this way if our Solr crashes we do not need to start from scratch).

...

Testing Solr Shards

Testing Solr Shards