Design - Indexing

Creating, modifying or deleting objects in the repository generates JMS events. The indexer listens to those events, retrieves the RDF from the repository once and then passes the RDF to a series of workers that process it in various ways, such as copying to a triplestore or Solr indexing.

One of the major goals of this event-based indexing approach is to reduce the impact of indexing on core repository functionality. The repository just creates a JMS event (containing only the object pid and the event type, which are already in memory), and does not need to do any extra work for indexing before moving on to its next task. When repository updates happen at a faster rate than the indexer can match, JMS events can wait in the queue until the indexer catches up, and the updates can continue without waiting. When processing large batches of updates, you can even disable the indexer.

The indexer can have any number of workers configured to process the events. So the main indexer process retrieves the object RDF from the repository, and that content can be reused by multiple workers. If you want to process the events in several ways (triplestore, Solr, archive to disk, update remote repository, etc.), this limits the number of times the metadata has to be retrieved from the repository to once each time the object is updated.

Configuration

The indexer is configured using Spring. Here is a sample configuration fragment showing two workers and the framework for listening to events and connecting them with the workers

  <!-- Worker #1: Copy object RDF to a Fuseki triplestore using SPARQL Update -->
  <bean id="sparqlUpdate" class="org.fcrepo.indexer.SparqlIndexer">
    <!-- base URL for triplestore subjects, PID will be appended -->
    <property name="prefix" value="http://localhost:${test.port:8080}/rest/objects/"/>
    <property name="queryBase" value="http://localhost:3030/test/query"/>
    <property name="updateBase" value="http://localhost:3030/test/update"/>
    <property name="formUpdates">
      <value type="java.lang.Boolean">false</value>
    </property>
  </bean>

  <!-- Worker #2: Save object RDF to timestamped files on disk -->
  <bean id="fileSerializer" class="org.fcrepo.indexer.FileSerializer">
    <property name="path" value="./target/test-classes/fileSerializer/"/>
  </bean>

  <!-- Main indexer class that processes events, gets RDF from the repository and calls the workers -->
  <bean id="indexerGroup" class="org.fcrepo.indexer.IndexerGroup">
    <property name="repositoryURL" value="http://localhost:${test.port:8080}/rest/objects/" />
    <property name="indexers">
      <set>
        <ref bean="fileSerializer"/>
        <ref bean="sparqlUpdate"/>
      </set>
    </property>
  </bean>

  <!-- ActiveMQ queue to listen for events -->
  <bean id="destination" class="org.apache.activemq.command.ActiveMQTopic">
    <constructor-arg value="fedora" />
  </bean>

  <!-- Message listener container to connect the JMS queue to the indexer -->
  <bean id="jmsContainer" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
    <property name="connectionFactory" ref="connectionFactory"/>
    <property name="destination" ref="destination"/>
    <property name="messageListener" ref="indexerGroup" />
    <property name="sessionTransacted" value="true"/>
  </bean>

Extending the Indexer

To implement a new kind of indexer:

Implement the indexing functionality using the org.fcrepo.indexer.Indexer interface, which consists of only two methods (one to handle new/updated records, and another to handle deleted records). Any configuration required should be done using Java bean setter methods.
Update the Spring configuration to add a bean referencing the new class and providing the configuration properties needed.
Add the bean to the list of workers invoked by the indexer.

Trying Out the Indexer

The easiest way to get hands-on experience with the indexer and see updates synced with an external triplestore is to use the kitchen sink project. The kitchen sink offers a Fedora4 repository with the indexer pre-configured to sync to a Fuseki triplestore. To set this up, first download and run the Fuseki triplestore. Then build and run the pre-configured Fedora4:

$ git clone https://github.com/futures/fcrepo-kitchen-sink.git
  
$ cd fcrepo-kitchen-sink
$ git checkout fuseki 
$ MAVEN_OPTS
  =
  "-Xmx1024m -XX:MaxPermSize=1024m" mvn install
$ MAVEN_OPTS
  =
  "-Xmx512m" mvn jetty:run

Using the default settings, Fedora4 will be running at http://localhost:8080/rest/ – you can create, update and delete objects and datastreams using your browser. Each event will trigger the indexer and be synced to Fuseki, which you can access at http://localhost:3030/.

Page tree

Design - Indexing

Configuration

Extending the Indexer

Trying Out the Indexer