Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Creating, modifying or deleting objects resources in the repository generates JMS events. The indexer listens to those events, and retrieves the RDF from the repository.  The indexer can be configured to process the event in various ways, such as copying the object resource RDF to a triplestore or indexing in Solr.

One of the major goals of this event-based indexing approach is to reduce the impact of indexing on core repository functionality.  The repository just creates a JMS event (containing only the object pid resource identifier and the event type, which are already in memory), and does not need to do any extra work for indexing before moving on to its next task.  When repository updates happen at a faster rate than the indexer can match, JMS events can wait in the queue until the indexer catches up, and the updates can continue without waiting.  When processing large batches of updates, you can even disable the indexer.

The indexer can have any number of workers configured to process the events.  So the main indexer process retrieves the object resource RDF from the repository, and that content can be reused by multiple workers.  If you want to process the events in several ways (triplestore, Solr, archive to disk, update remote repository, etc.), this limits the number of times the metadata has to be retrieved from the repository to once each time the object resource is updated.

Indexer Modules

...

No Format
  <!-- Worker #1: Copy objectresource RDF to a Fuseki triplestore using SPARQL Update -->
  <bean id="sparqlUpdate" class="org.fcrepo.indexer.SparqlIndexer">
    <!-- base URL for triplestore subjects, PID will be appended -->
    <property name="prefix" value="http://localhost:${test.port:8080}/rest/objects/"/>
    <property name="queryBase" value="http://localhost:3030/test/query"/>
    <property name="updateBase" value="http://localhost:3030/test/update"/>
    <property name="formUpdates">
      <value type="java.lang.Boolean">false</value>
    </property>
  </bean>

  <!-- Worker #2: Save objectresource RDF to timestamped files on disk -->
  <bean id="fileSerializer" class="org.fcrepo.indexer.FileSerializer">
    <property name="path" value="./target/test-classes/fileSerializer/"/>
  </bean>
 
  <!-- jcr/xml persistence Indexer -->
  <bean id="jcrXmlPersist" class="org.fcrepo.indexer.persistence.JcrXmlPersistenceIndexer">
    <constructor-arg value="${fcrepo.jcrxml.storage:fcrepo4-jcrxml}" />
  </bean>
 
  <!-- Main indexer class that processes events, gets RDF from the repository and calls the workers -->
  <bean id="indexerGroup" class="org.fcrepo.indexer.IndexerGroup">
    <constructor-arg name="repositoryURL" value="http://${fcrepo.host:localhost}:${fcrepo.port:8080}${fcrepo.context:/}rest" />
    <constructor-arg name="indexers">
      <set>
        <ref bean="jcrXmlPersist"/>
        <ref bean="fileSerializer"/>
        <ref bean="sparqlUpdate"/>
      </set>
    </constructor-arg>
    <!-- If your Fedora instance requires authentication, enter the credentials here. Leave blank if your repo is open. -->
    <constructor-arg name="fedoraUsername" value="${fcrepo.username:}" />
    <constructor-arg name="fedoraPassword" value="${fcrepo.password:}" />
  </bean>

  <!-- ActiveMQ queue to listen for events -->
  <bean id="destination" class="org.apache.activemq.command.ActiveMQTopic">
    <constructor-arg value="fedora" />
  </bean>

  <!-- Message listener container to connect the JMS queue to the indexer -->
  <bean id="jmsContainer" class="org.springframework.jms.listener.DefaultMessageListenerContainer">
    <property name="connectionFactory" ref="connectionFactory"/>
    <property name="destination" ref="destination"/>
    <property name="messageListener" ref="indexerGroup" />
    <property name="sessionTransacted" value="true"/>
  </bean>

...

No Format
  <!-- Worker #1: Copy objectresource RDF to a Sesame triplestore using SPARQL Update -->
  <bean id="sparqlUpdate" class="org.fcrepo.indexer.SparqlIndexer">
    <!-- base URL for triplestore subjects, PID will be appended -->
    <property name="prefix" value="http://localhost:${test.port:8080}/rest/objects/"/>
    <property name="queryBase" value="http://localhost:8081/openrdf-sesame/repositories/test"/>
    <property name="updateBase" value="http://localhost:8081/openrdf-sesame/repositories/test/statements"/>
    <property name="formUpdates">
      <value type="java.lang.Boolean">true</value>
    </property>
  </bean>

...

  1. Implement the indexing functionality using the org.fcrepo.indexer.Indexer interface, which consists of only two methods (one to handle new/updated records, and another to handle deleted records).  Any configuration required should be done using Java bean setter methods.
  2. Update the Spring configuration to add a bean referencing the new class and providing the configuration properties needed.
  3. Add the bean to the list of workers invoked by the indexer.

...

The triplestore and Fedora4 do not need to be aware of each other or of the JMS listener. However, the event-listener needs to know the web-endpoints of both the triplestore and Fedora 4. It is therefore important that you start the three components on different ports. 

Instructions on how to start up and configure the three components follows:

...

You can deploy Fedora4 either by downloading the latest war file and dropping it into an application container (e.g. Tomcat7). Or you can clone the Git fcrepo4 project and run the fcrepo-webapp directly within the code base.

See the following pages for details on either approach:

...

To configure the JMS indexer to connect to the Fedora Repository, you can set the following system variables

Code Block
-Dfcrepo.host=<defaults.to.localhost>

...


-Dfcrepo.port=<defaults.to.8080>

 

To configure the JMS indexer to connect to the triplestore, you can set the following system variables

Code Block
-Dfuseki.host=<defaults.to.

...

localhost> 
-Dfuseki.port=<defaults.to.3030>


 

 

... or if you are using Sesame:

 

Code Block
-Dsesame.host=<defaults.to.localhost> 

...


-Dsesame.port=<defaults.to.8081>

 

Finally, you will potentially need to set the output directory for the FileSerializer (which is a testing class for showing what is being indexed)

Code Block
-Dfile.serializer.dir=<defaults.to.webcontainer.target>

 

Below is an example of how to download, build, and start the JMS indexer.

...

If the Fedora Repository is be running at http://localhost:8080/rest/  – you can create, update and delete resources using your browser, or using the REST API (see SPARQL Recipes ).  Each event will trigger the indexer and be synced to Fuseki (or Sesame), which you can access at  http://localhost:3030/  (if you have Fuseki running on its default port).

  • Anchor
    reindex
    reindex
    Reindexing

If you have a repository with existing content that you want to index, or have changed your indexing logic and want to reindex content, you can use the reindex REST API call in the indexer webapp.

...

Code Block
languagebash
$ curl -X POST -d baseURI=http://localhost:8080/rest/objects/ -d recursive=false http://localhost:8082/reindex
  • Anchor
    multi
    multi
    Indexing Multiple Repositories to a single Triplestore

In some situations it is desirable to have multiple Fedora repositories all feeding into a single external triplestore. In order to accomplish this, we need to install and setup the three components (Triplestore, Fedora 4 Repository and JMS event listener/indexer) as follows:

  • Follow the instructions above to install the triplestore (Fuseki or Sesame) in one machine and start it.

  • Follow the instructions above to install two or more Fedora 4 Repositories in different machines and start them.

  • Install JMS event listener/indexer (https://github.com/fcrepo4/fcrepo-message-consumer) for each Fedora 4 repository installation and start the indexer with the following command:

    Code Block
    languagebash
    $ mvn -D jetty.port=9999 -Dfuseki.host=<triplestore.host.name> -Dfcrepo.host=<repository.host.name> jetty:run
  • Notes

    • To make

...

    • a resource indexable in the triplestore, the

...

    • resource needs to include indexable mixin type: http://fedora.info/definitions/v4/indexing#indexable, which can be inserted through a SPARQL insert:

      Code Block
      INSERT {<> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://fedora.info/definitions/v4/indexing#indexable> }.
    • Start the triplestore first. If the triplestore is restarted, then the JMS event listener/indexer needs to be restarted, too.