Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Solr can be set up in a "cloud mode" which supports redundancy and scaling-out.  SolrCloud also activates new APIs which we might leverage in place of code that we now provide for manipulating Solr cores.  We need to decide whether the new APIs are useful enough to require the use of cloud mode by all sites.  A Solr cloud can consist of a single instance with an internal copy of Apache ZooKeeper (which is used to orchestrate multiple instances), so it may be relatively simple for a site with modest requirements to do that and have support for APIs that we choose to use.  It does mean more moving parts, including new ports to be secured.  We currently use an older mode of sharding which is now considered "legacy" and probably won't get much attention in the future, so we might choose to take advantage of the Collections API now as an attempt at future-proofing DSpace's use of Solr.  On the other hand, advice about running SolrCloud mostly assumes that you are running a large installation with multiple instances and multiple external ZooKeeper nodes to manage them, so there may not be much help out there for single-instance production SolrCloud sites.  We should find out why the legacy mode still exists, whether it is intended to disappear some day, and whether a minimal cloud setup is significantly harder to manage than stand-alone.

  • If multiple shards are already in use, how should those be migrated into the new version of Solr.
  • As a Solr instance grows (specifically statistics), what scaling options exist?  If Solr Cloud is the solution, how difficult will it be to make that migration later?

Our use of sharding is, well, a bit eccentric.  Sharding was introduced into Solr to spread the work of searching a large index across multiple storage drives and/or host nodes, and most support for it is aimed at randomly distributing records across shards.  DSpace defines shards by clumping records timestamped with the same year into a single shard, expecting the administrator to create new yearly shards as needed.  Recent Solr versions implement Time Routed Aliases which we should consider as a replacement.

...

  • solr-export-statistics and solr-import-statistics
  • solr-reindex-statistics
  • stats-log-importer
  • stats-util
  • solr-upgrade-statistics-6x (we will need to provide this for folks upgrading from 4x or 5x to 7x)

Related Tickets and Pull Requests

  • Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyDS-3691
     Improve search stemming
  • https://github.com/DSpace/DSpace/pull/2058 - Upgrade Solr Client
  • Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyDS-4066
    Upgrade statistics from id to uuid