Overview

 

The Match tool will look at the numbers generated by Score and compare them to a threshold value. Input entities compared by Score that meet or exceed the threshold will have their identities changed to the URI of the person in VIVO, so that when the data is finally pulled into VIVO the new data will be linked to existing data. In this way you can fetch publications for your existing researchers.

Match.java takes a model generated by Score and renames matches, creates links, or removes literals based on the associated scores.

Match uses a threshold value, which is compared with the values for each individual produced by Score to determine if action should be taken. If the Score value equals or exceeds the threshold, then action is taken. Typically this action is a rename. A rename means that the matched input is given the same URI as the individual in VIVO matching it, which identifies the two as the same entity. This can then be merged into VIVO. In this way an existing individual in VIVO can have more data added to it. For example, a publication can have a previously missing publisher added to it.

Match config

Match - Uses the score-data produced by Score to evaluate pairs of records from the input and VIVO.

Records that match above a user-specified threshold are considered to be a matching set. Matching records from the input can be renamed to the matching VIVO URI, or can be linked to them via user-specified triples. Additionally, the input data can be sanitized to remove all rdf:type statements and all literal value statements, preserving only the node relationship triples.

Match Parameters

wordiness - (optional) sets the lowest level of log messages to be displayed to the console. The lower the log level, the more detailed the messages.

Possible Values:

threshold - match records with a cumulative weighted-score over this value

Example:

link - (optional) link the two matched entities together using two provided predicates

Example:

rename - (optional) rename the input node to the matching VIVO URI

Possible Value:

clear-type-and-literals - (optional) sanitize the input data to remove all rdf:type statements and all literal value statements, preserving only the node relationship triples

Example:

batch-size - (optional) number of records to process in batch - default 150 - lower this if getting StackOverflow or OutOfMemory Exceptions

Example:

input-config - (optional - at least one of this and/or inputOverride) the configuration file that describes the input jena model. The parameters for this config file are described in the Models section below.

Example:

inputOverride - (optional - at least one of this and/or input-config) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.

Example:

score-config - (optional - at least one of this and/or scoreOverride) the configuration file that describes the score jena model. The parameters for this config file are described in the Models section below.

Example:

scoreOverride - (optional - at least one of this and/or score-config) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.

Example:

output-config - (optional - will contain all related nodes for matches) the configuration file that describes the output jena model. The parameters for this config file are described in the Models section below.

Example:

outputOverride - (optional - will contain all related nodes for matches) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.

Example:

Arguments

Short Option

Long Option

Parameter Value Map

Description

Required

i

inputJena-config

CONFIG_FILE

inputJena JENA configuration filename

true

I

inputOverride

override the JENA_PARAM of inputJena jena model config using VALUE

false

o

output-config

CONFIG_FILE

outputConfig JENA configuration filename

true

V

vivoOverride

override the JENA_PARAM of vivoJena jena model config using VALUE

false

s

score-config

CONFIG_FILE

score data JENA configuration filename

true

S

scoreOverride

override the JENA_PARAM of score jena model config using VALUE

false

t

threshold

THRESHOLD

match records with a score over THRESHOLD

true

l

link

link the two matched entities together using INPUT_TO_VIVO_PREDICATE and INPUT_TO_VIVO_PREDICATE

false

r

rename

 

rename or remove the matched entity from scoring

false

c

clear-type-and-literals

 

clear all rdf:type and literal values out of the nodes matched

false

Usage

//from the env file
Match="java $OPTS -Dprocess-task=Match org.vivoweb.harvester.score.Match"

//from the script file
SCOREINPUT="-i $H2MODEL -ImodelName=$MODELNAME -IdbUrl=$MODELDBURL -IcheckEmpty=$CHECKEMPTY"
SCOREDATA="-s $H2MODEL -SmodelName=$SCOREDATANAME -SdbUrl=$SCOREDATADBURL -ScheckEmpty=$CHECKEMPTY"
MATCHOUTPUT="-o $H2MODEL -OmodelName=$MATCHEDNAME -OdbUrl=$MATCHEDDBURL -OcheckEmpty=$CHECKEMPTY"
MATCHTHRESHOLD = 1.0

$Match $SCOREINPUT $SCOREDATA $MATCHOUTPUT -t $MATCHTHRESHOLD -r -c

This call in the scripts will rename entries in $SCOREINPUT according to $SCOREDATA which is weighted above $MATCHTHRESHOLD while clearing literals and types and sending a copy of the matched pieces to $MATCHOUTPUT.
Scripts like pubmed which is interested in only matches will use the match output.

Methods

match

Find all nodes in the given namepsace matching on the given predicates

  1. Create a querystring on the scoredata to get the results with scores greater or equal to the threshold.
  2. Execute the query with the input URI in sInput and the vivo URI in sVivo
  3. For each solution to the query
    1. Create string of the sInput
    2. Create String of the sVIVO
    3. Pair up the sInput URI and the sVivo URI in a uriMatchMap
  4. Return the the uriMatchMap

rename

Rename the resource set as the key to the value matched

  1. For each entry in a <string,string> map
    1. Get the resource related to the old URI
    2. Get the new URI from the map.
    3. Rename the old resource to the new resource

link

Link matched scoreResources to vivoResources using given linking predicates

  1. Create vivoToInput property
  2. Create inputToVivo property
  3. For each uri in the match set
    1. Get the resource for the input URI
    2. Get the resource for the vivo URI
    3. Add the inputToVivo property as applied to the resources
    4. Add the vivoToInput property as applied to the resources

Sparql

The match class runs a sparql query on the score data. This can help access the score data for other purposes.

?sInput = Input URI
?sVivo  = Vivo URI

PREFIX scoreValue: <http://vivoweb.org/harvester/scoreValue/>
SELECT DISTINCT ?sVivo ?sInput (sum(?weightValue) AS ?sum)
WHERE {
  ?s scoreValue:InputRes ?sInput . 
  ?s scoreValue:VivoRes ?sVivo .
  ?s scoreValue:hasScoreValue ?value .
  ?value scoreValue:WeightedScore ?weightValue .
}
GROUP BY ?sVivo ?sInput 
HAVING (?sum >= threshold ) 
ORDER BY ?sInput