Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE: solr-reindex-statistics is safe to run on a live site. The script stores incoming usage data in a temporary SOLR core, and then merges that new data into the reindexed data when the reindex process completes.

Upgrade Legacy DSpace Object Identifiers (pre-6x statistics) to DSpace 6x UUID Identifiers

Info
titleThis feature has not yet been released.

This command will be introduced in the DSpace 6.4 and DSpace 7.0 releases.

It is recommended that all DSpace instances with legacy identifiers perform this one-time upgrade of legacy statistics records.

This action is safe to run on a live site. As a precaution, it is recommended that you backup you statistics shards before performing this action.

Note: a link to this section of the documentation should be added to the DSpace 6.4 and DSpace 7.0 Release Notes.


The DSpace 6x code base changed the primary key for all DSpace objects from an integer id to UUID identifiers.  Statistics records that were created before upgrading to DSpace 6x contain the legacy identifiers.  

While the DSpace user interfaces make some attempt to correlate legacy identifiers with uuid identifiers, it is recommended that users perform this one time upgrade of legacy statistics records.  

If you have sharded your statistics repository, this action must be performed on each shard.


Command used:

[dspace]/bin/dspace solr-upgrade-statistics-6x

Java class:

org.dspace.util.SolrUpgradePre6xStatistics

Arguments (short and long forms):

Description

- i or - -index-name

Optional, the name of the index to process. "statistics" is the default

-n or --num_rec

Optional. Total number of records to update (defaut=100,000).

To process all records, set -n to 10000000 or to 100000000 (10M or 100M)
If possible, please allocate 2GB of memory to this process (e.g. -Xmx2000m)

-b or --batch_size

Number of records to batch update to SOLR at one time (default=10,000).


NOTE: This process will rewrite most solr statistics records and may temporarily double the size of your statistics repositories. Consider optimizing your solr repos when complete.

If a UUID value cannot be found for a legacy id, the legacy id will be converted to the form "xxxx-unmigrated" where xxxx is the legacy id.  


Routine Solr Index Maintenance

...

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-s or --shard-solr-index

Splits the data in the main Solr core up into a separate solr core for each year, this .  This will upgrade the performance of the solrSolr.

Notes:

Yearly Solr sharding is a routine that can drastically improve the performance of your DSpace SOLR statistics. It was introduced in DSpace 3.0 and is not backwards compatible. The routine decreases the load created by the logging of new usage events by reducing the size of the SOLR Core in which new usage data are being logged. By running the script, you effectively split your current SOLR core, containing all of your usage events, into different SOLR cores that each contain the data for one year. In case your DSpace has been logging usage events for less than one year, you will see no notable performance improvements until you run the script after the start of a new year. Both writing new usage events as well as read operations should be more performant over several smaller SOLR Shards instead of one monolithic one.

...

Code Block
# At 12:00AM on January 1, "shard" the DSpace Statistics Solr index.  Ensures each year has its own Solr index - this improves performance.
0 0 1 1 * [dspace]/bin/dspace stats-util -s


Infonote
title"View Usage Statistics"You MUST restart Tomcat after sharding

After running the statistics shard process, the "View Usage Statistics" page(s) in DSpace will not immediately automatically recognize the new shard.

Restart tomcat to ensure that the new shard is recognized & included in usage statistics queries.

...

Technical implementation details

After sharding, the Solr data cores are located in the [dspace.dir]/solr directory. There is no need to define the location of each individual core in solr.xml because they are automatically retrieved at runtime. This retrieval happens in the static method located in the org.dspace.statistics.SolrLogger class. These cores are stored in the statisticYearCores list.  Each time a query is made to Solr, these cores are added as shards by the addAdditionalSolrYearCores method. The cores share a common configuration copied from your original statistics core. Therefore, no issues should be resulting from subsequent ant updates.

The actual sharding of the of the original Solr core into individual cores by year is done in the shardSolrIndex method in the org.dspace.statistics.SolrLogger class. The sharding is done by first running a facet on the time to get the facets split by year. Once we have our years from our logs we query the main Solr data server for all information on each year & download these as CSVs. When we have all data for one year, we upload it to the newly created core of that year by using the update csv handler. Once all data of one year have been uploaded, those data are removed from the main Solr (by doing it this way if our Solr crashes we do not need to start from scratch).

...

Testing Solr Shards

Testing Solr Shards