The VIVO Harvester is a collection of small Java tools which are meant to be strung together in various ways to create a harvest custom-tailored to your needs. This architecture makes the Harvester extremely versatile, but at the same time presents a steep learning curve.

Included in Harvester's scripts/ directory are several sample scripts which have been tested and will perform different types of harvests. Finding one that is close to your needs, testing it on a test server or virtual machine, and tweaking it until it meets your needs is one way to get started.

In this page the steps of a "typical" harvest are described.

Fetch

The first step of a typical harvest is the fetch. This is where we grab data from whatever sources we are seeking to acquire data from. For example, let us suppose we have a VIVO installation containing researchers at our university, and we want to harvest from Pubmed information on publications written by researchers at our university. In this case we would use Harvester's PubmedFetch tool to send a query off to Pubmed, which will return the results of that query to us in its own XML format.

Translate

The next step of a typical harvest is the translation. The fetched data will be in its own format, and this needs to be converted into VIVO-compatible triples in RDF/XML format. If the input is an XML format, this is done using the XSLTranslator tool and a .xsl file containing XSLT code specific to the data format being converted to RDF/XML triples.

Included with Harvester in the config/datamaps/ directory are several pre-written XSLT files for frequently-needed formats (including for example Pubmed).

Score and Matching

The next step attempts to match incoming data with data already in VIVO. For example, if you have just pulled in some publication information from Pubmed, you might want to compare the author names with people in your VIVO, so that you can link the publications with the authors. This comparison is done via the Score tool, which compares any values you want between VIVO and the input data, and assigns a number to the comparison.

The immediate next step is to call the Match tool, which will look at the numbers generated by Score and compare them to a threshold value. Input entities compared by Score that meet or exceed the threshold will have their identities changed to the URI of the person in VIVO, so that when the data is finally pulled into VIVO the new data will be linked to existing data. In this way you can fetch publications for your existing researchers.

Namespace Change

The next step is to give a proper URI to all unmatched data (matched data already has the URI of the matching VIVO person). This is done via the ChangeNamespace tool. Prior to this step, URIs are arbitrarily provided by the XSLT translation (typically using aspects of the raw data that are expected to be unique, such as an ISBN number). After this step all data has a proper VIVO URI and is ready for import into VIVO.

Updates

This step allows for multiple Harvester runs in succession to recognize data that has been modified since the previous run and update accordingly. A "previous harvest model" is created, which on the first run contains all the data imported on that run. On subsequent runs, this is compared with the new data to determine triples that have been removed or added since the last run. This comparison is made by the Diff tool, and the output is an "Additions file" and a "Subtractions file", containing RDF/XML data that should be added and removed, respectively, from VIVO.

Add data

The data from the Additions file is added both to VIVO and the previous harvest model in two separate calls of the Transfer tool. Then the data from the Subtractions file is removed both from VIVO and the previous harvest model in two more Transfer calls.

At this point a harvest is complete.

Space shortcuts

Page tree

Fetch

Translate

Score and Matching

Namespace Change

Updates

Add data

Space shortcuts

Page tree

Typical harvest

Fetch

Translate

Score and Matching

Namespace Change

Updates

Add data