Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add link to instructions to run SolrUpgradePre6xStatistics

...

Command used:

[dspace]/bin/dspace solr-export-statistics

Java class:

org.dspace.util.SolrImportExport

Arguments (short and long forms):

Description

- i or - -index-name

optional, the name of the index to process. "statistics" is the default

-l or --last integer

optionally export only integer many days worth of statistics
-d or --directoryoptional, directory to use for storing the exported files. By default, [dspace]/solr-export is used. If that is not appropriate (due to storage concerns), we recommend you use this option to specify a more appropriate location.

- f or - -force-overwrite

optional, overwrite export file if it exists (DSpace 6.1 and later)

Import SOLR statistics, for restoring lost data or moving to another server

c

Command used:

[dspace]/bin/dspace solr-import-statistics

Java class:

org.dspace.util.SolrImportExport

Arguments (short and long forms):

Description

-

i or - -index-name

optional, the name of the index to process. "statistics" is the default

-c or --clearclear

optional, clears the contents of the existing stats core before importing
-d or --directoryoptional, directory which contains the files for importing. By default, [dspace]/solr-export is used. If that is not appropriate (due to storage concerns), we recommend you use this option to specify a more appropriate location.

Reindex SOLR statistics, for upgrades or whenever the Solr schema for statistics is changed

Command used:

[dspace]/bin/dspace solr-reindex-statistics

Java class:

org.dspace.util.SolrImportExport

Arguments (short and long forms):

Description

- i or - -index-name

optional, the name of the index to process. "statistics" is the default

-k or --keep

optional, tells the script to keep the intermediate export files for possible later use (by default all exported files are removed at the end of the reindex process).
-d or --directoryoptional, directory to use for storing the exported files (temporarily, unless you also specify --keep, see above). By default, [dspace]/solr-export is used. If that is not appropriate (due to storage concerns), we recommend you use this option to specify a more appropriate location. Not sure about your space requirements? You can estimate the space required by looking at the current size of [dspace]/solr/statistics

- f or - -force-overwrite

optional, overwrite export file if it exists (DSpace 6.1 and later)

NOTE: solr-NOTE: solr-reindex-statistics is safe to run on a live site. The script stores incoming usage data in a temporary SOLR core, and then merges that new data into the reindexed data when the reindex process completes.

Routine Solr Index Maintenance

...

Command used:

...

[dspace]/bin/dspace stats-util

...

Java class:

...

org.dspace.statistics.util.StatisticsClient

...

Arguments (short and long forms):

...

Description

...

-o or --optimize

...

Run maintenance on the SOLR index. Recommended to run daily, to prevent your servlet container from running out of memory

Notes:

The usage of this this option is strongly recommended, you should run this script daily (from crontab or your system's scheduler), to prevent your servlet container from running out of memory.

Solr Sharding By Year

...

Command used:

...

[dspace]/bin/dspace stats-util

...

Java class:

...

org.dspace.statistics.util.StatisticsClient

...

Arguments (short and long forms):

...

Description

...

-s or --shard-solr-index

...

Splits the data in the main core up into a separate solr core for each year, this will upgrade the performance of the solr.

Upgrade Legacy DSpace Object Identifiers (pre-6x statistics) to DSpace 6x UUID Identifiers

Info
titleThis feature has not yet been released.

This command will be introduced in the DSpace 6.4 and DSpace 7.0 releases.

It is recommended that all DSpace instances with legacy identifiers perform this one-time upgrade of legacy statistics records.

This action is safe to run on a live site. As a precaution, it is recommended that you backup you statistics shards before performing this action.

Note: a link to this section of the documentation should be added to the DSpace 6.4 and DSpace 7.0 Release Notes.

Note: https://groups.google.com/forum/#!topic/dspace-tech/HbdmAGw2C1E gives instructions for running SolrUpgradePre6xStatistics.


The DSpace 6x code base changed the primary key for all DSpace objects from an integer id to UUID identifiers.  Statistics records that were created before upgrading to DSpace 6x contain the legacy identifiers.  

While the DSpace user interfaces make some attempt to correlate legacy identifiers with uuid identifiers, it is recommended that users perform this one time upgrade of legacy statistics records.  

If you have sharded your statistics repository, this action must be performed on each shard.


Command used:

[dspace]/bin/dspace solr-upgrade-statistics-6x

Java class:

org.dspace.util.SolrUpgradePre6xStatistics

Arguments (short and long forms):

Description

- i or - -index-name

Optional, the name of the index to process. "statistics" is the default

-n or --num_rec

Optional. Total number of records to update (defaut=100,000).

To process all records, set -n to 10000000 or to 100000000 (10M or 100M)
If possible, please allocate 2GB of memory to this process (e.g. -Xmx2000m)

-b or --batch_size

Number of records to batch update to SOLR at one time (default=10,000).


NOTE: This process will rewrite most solr statistics records and may temporarily double the size of your statistics repositories. Consider optimizing your solr repos when complete.

If a UUID value cannot be found for a legacy id, the legacy id will be converted to the form "xxxx-unmigrated" where xxxx is the legacy id.  



Routine Solr Index Maintenance

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-o or --optimize

Run maintenance on the SOLR index. Recommended to run daily, to prevent your servlet container from running out of memory

Notes:

The usage of this this option is strongly recommended, you should run this script daily (from crontab or your system's scheduler), to prevent your servlet container from running out of memory.

Solr Sharding By Year

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-s or --shard-solr-index

Splits the data in the main Solr core up into a separate core for each year.  This will upgrade the performance of Solr.

Notes:

Yearly Solr sharding is a routine that can drastically improve the performance of your DSpace SOLR statistics. It was introduced in DSpace 3.0 and is not backwards compatible. The routine decreases the load created by the logging of new usage events by reducing the size of the SOLR Core in which new usage data are being logged. By running the script, you effectively split your current SOLR core, containing all of your usage events, into different SOLR cores that each contain the data for one year. In case your DSpace has been logging usage events for less than one year, you will see no notable performance improvements until you run the script after the start of a new year. Both writing new usage events as well as read operations should be more performant over several smaller SOLR Shards instead of one monolithic one.

It is highly recommended that you execute this script once at the start of every year. To ensure this is not forgotten, you can include it in your crontab or other system scheduling software.  Here's an example cron entry (just replace [dspace] with the full path of your DSpace installation):

Code Block
# At 12:00AM on January 1, "shard" the DSpace Statistics Solr index.  Ensures each year has its own Solr index - this improves performance.
0 0 1 1 * [dspace]/bin/dspace stats-util -s


Note
titleYou MUST restart Tomcat after sharding

After running the statistics shard process, the "View Usage Statistics" page(s) in DSpace will not automatically recognize the new shard.

Restart tomcat to ensure that the new shard is recognized & included in usage statistics queries.


Warning
titleRepair of Shards Created Before DSpace 5.7 or DSpace 6.1

If you ran the shard process before upgrading to DSpace 5.7 or DSpace 6.1, the multi-value fields such as owningComm and onwningColl are likely be corrupted. Previous versions of the shard process lost the multi-valued nature of these fields. Without the multi-valued nature of these fields, it is difficult to query for statistics records by community / collection / bundle.

You can verify this problem in the solr admin console by looking at the owningComm field on existing records and looking for the presence of "\\," within that field.

The following process may be used to repair these records.

  1. Backup your solr statistics-xxxx directories while tomcat is down.
  2. Backup and delete the contents of the dspace-install/solr-export directory
  3. For each "statistics-xxxx" shard that exists, export the repository

    Code Block
    languagebash
    dspace solr-export-statistics -i statistics-xxxx -f


  4. Run the following to repair records in the dspace-install/solr-export directory

    Code Block
    languagebash
    for file in * 
    do 
    sed -E -e "s/[\\]+,/,/g" -i $file
    done


  5. For each shard that was exported, run the following import

    Code Block
    languagebash
    dspace solr-import-statistics -i statistics-xxxx -f


If you repeat the query that was run previously, the fields containing "\\," should now contain an array of owning community ids.


Info
titleShard Naming

Prior to the release of DSpace 6.1, the shard names created were off by one year in timezones with a positive offset from GMT.

Shards created subsequent to this release may appear to skip by one year.
See
Jira
serverDuraSpace JIRA
serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
keyDS-3437

Notes:

Yearly Solr sharding is a routine that can drastically improve the performance of your DSpace SOLR statistics. It was introduced in DSpace 3.0 and is not backwards compatible. The routine decreases the load created by the logging of new usage events by reducing the size of the SOLR Core in which new usage data are being logged. By running the script, you effectively split your current SOLR core, containing all of your usage events, into different SOLR cores that each contain the data for one year. In case your DSpace has been logging usage events for less than one year, you will see no notable performance improvements until you run the script after the start of a new year. Both writing new usage events as well as read operations should be more performant over several smaller SOLR Shards instead of one monolithic one.

It is highly recommended that you execute this script once at the start of every year. To ensure this is not forgotten, you can include it in your crontab or other system scheduling software.  Here's an example cron entry (just replace [dspace] with the full path of your DSpace installation):

Code Block
# At 12:00AM on January 1, "shard" the DSpace Statistics Solr index.  Ensures each year has its own Solr index - this improves performance.
0 0 1 1 * [dspace]/bin/dspace stats-util -s

...

Technical implementation details

After sharding, the SOLR Solr data cores are located in the [dspace.dir]/solr directory. There is no need to define the location of each individual core in solr.xml because they are automatically retrieved at runtime. This retrieval happens in the static method located in the org.dspace.statistics.SolrLogger class. These cores are stored in the statisticYearCores list each .  Each time a query is made to the solr Solr, these cores are added as shards by the addAdditionalSolrYearCores method. The cores share a common configuration copied from your original statistics core. Therefore, no issues should be resulting from subsequent ant updatesupdates.

The actual sharding of the of the original solr Solr core into individual cores by year is done in the shardSolrIndex method in the org.dspace.statistics.SolrLogger class. The sharding is done by first running a facet on the time to get the facets split by year. Once we have our years from our logs we query the main solr Solr data server for all information on each year & download these as csv'sCSVs. When we have all data for one year, we upload it to the newly created core of that year by using the update csv handler. One Once all data of one year has have been uploaded that , those data is are removed from the main solr Solr (by doing it this way if our solr Solr crashes we do not need to start from scratch).

Info
titleMultiple Shard Fix (DSpace 6.1)

A bug exists in the DSpace 6.0 release that prevents tomcat from starting when multiple shards are present.

To address this issue, the initialization of SOLR shards is deferred until the first SOLR related requests are processed.

See

Jira
serverDuraSpace JIRA
serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
keyDS-3457

 

Testing Solr Shards

Testing Solr Shards