Overview

Diff takes two data models (called a "minuend" model and a "subtrahend" model), compares them, and returns all triples in the minuend model that are not in the subtrahend model.

For example, a Diff between model M1 and M2, where M1 contains records (A, B, C, D, E) and M2 contains (C, D, E, F, G) would return the set (A, B). Reversing the arguments would return (F, G).

Usage

This is commonly used in Harvester scripts immediately prior to entering newly harvested data into VIVO. When the results of the previous run of the harvest are used as the minuend, and the new input is the subtrahend, then the output of Diff is items which are already in VIVO due to the previous harvest but which are not present in the new harvest. These are assumed to have been removed and/or replaced, and so this result (called the "subtraction file") is used to remove data from VIVO. When the arguments are reversed, the result is items in the new harvest which were never previously harvested. This result (called the "addition file") is used to add data to VIVO.

The subtraction file and the addition file are then applied to VIVO (the -m switch for Transfer is used for the subtractions), and also to the "previous harvest model" so that it stays up-to-date allowing Diff to work properly for future runs.

Data in the intersection between the two sets are not output in either execution of Diff, since because they are identical both in the input and in VIVO, it would be unnecessary to handle them. In this way Diff is used to prevent unnecessary overhead on the production VIVO database.

Selective Diff

When diffing to create subtraction models, some processes may wish to preserve entities in the minuend (previous harvest) which are entirely not present in the subtrahend (new input). The selective diff mode generates a diff model as per the normal process, but prunes the resulting model such that core elements completely absent from the subtrahend are not included. This allows for changes to properties to pass through, but prevents the removal of large swaths of data in the event that the minuend contains large numbers of entries, but the subtrahend only includes updates to several.

Diff.java

Parameters

Short Option

Long Option

Parameter Value Map

Description

Required

m

minuend

CONFIG_FILE

config file for input jena model

true

M

minuendOverride

override the JENA_PARAM of input jena model config using VALUE

false

s

subtrahend

CONFIG_FILE

config file for input jena model

true

S

subtrahendOverride

override the JENA_PARAM of input jena model config using VALUE

false

o

output

CONFIG_FILE

config file for output jena model

true

O

outputOverride

override the JENA_PARAM of output jena model config using VALUE

false

d

dumptoFile

FILENAME

filename for output

true

 

e

selective-diff

 

use selective diff

false

U

update-types

type to be updated with selective diff

false

Properties

Name

Type

Visibility

Description

log

Logger

private

SLF4J Logger statically set at the top of the file

minuendJC

JenaConnect

private

subtrahendJC

JenaConnect

private

output

JenaConnect

private

dumpFile

String

private

String of the file name as set by the uesr

updateTypes

List<String>

private

List of types to be updated in selectiveDiff

bUsingSelectiveDiff

boolean

private

Whether or not we are using selectiveDiff. Implicitly true if updateTypes specified.

bHasUpdateTypes

boolean

private

Whether or not types are specified for use by selectiveDiff

Execution

-m config/jenaModels/h2.xml -M dbUrl="jdbc:h2:XMLVault/h2Pubmed/score/store" -M modelName=pubmedScore -s config/jenaModels/VIVO.xml -S modelName="http://vivoweb.org/ingest/pubmed" -d XMLVault/update_Additions.rdf.xml

Methods

getParser

  1. parse the arguments from the parameter list above

diff

  1. create a diff model
  2. get the minuend and subtrahends models
  3. run jena's .difference method minuend.difference(subtrahend)
  4. check for null dumpfile
    1. write out if not null to file
  5. check for null output
    1. write out if not null to output jenaconnect

selectiveDiff

  1. create a diff model
  2. get the minuend and subtrahends models
  3. run jena's .difference method minuend.difference(subtrahend)
  4. load subtrahend and diff models into temp memJena for multi-graph SPARQL query.
  5. if updateTypes are specified:
    1. for each type: create a difference model fragment containing to changes to properties of the type, preserving root node.
    2. append fragment to new diff model.
  6. if types are not specified:
    1. create a new difference model containing changes to properties of all types, preserving all root nodes.
  7. check for null dumpfile
    1. write out if not null to file
  8. check for null output
    1. write out if not null to output jenaconnect

buildPreservationQuery

  1. build and return SPARQL query string, taking an rdf:type from the updateTypes list.

execute

  1. call diff sending it the minuend, the subtrahend, the outputs

main

  1. Start Logger
  2. Run diff.execute passing the args[] to the constructor
  3. catch errors
    1. IllegalArgumentException
    2. IOException
    3. Exception

traceModel

  1. build a ResultSet from an ?s?p?o query on the model.
  2. for each query solution, trace solution.toString() to the logger.

Example

Update

How Diff is used to do "graph math" updating

V = vivo model

H = new harvested model

P = previous harvest model | A = additions

S = subtractions | |

A = Diff (H → P)

The additions are the triples that are in the new harvest when the old harvest triples are removed.

S = Diff (P → H)

The subtractions are the triples that are in the old harvest when the new harvest is removed.

P = V – S

P = V + A | The application of these to the old harvest will make the old harvest equal to the new harvest. |

V = V – S

V = V + A | The application of these to the vivo model will make the information in the vivo model agree with the harvest without tampering with the data. |