The Match tool will look at the numbers generated by Score and compare them to a threshold value. Input entities compared by Score that meet or exceed the threshold will have their identities changed to the URI of the person in VIVO, so that when the data is finally pulled into VIVO the new data will be linked to existing data. In this way you can fetch publications for your existing researchers. |
Match.java takes a model generated by Score and renames matches, creates links, or removes literals based on the associated scores.
Match uses a threshold value, which is compared with the values for each individual produced by Score to determine if action should be taken. If the Score value equals or exceeds the threshold, then action is taken. Typically this action is a rename. A rename means that the matched input is given the same URI as the individual in VIVO matching it, which identifies the two as the same entity. This can then be merged into VIVO. In this way an existing individual in VIVO can have more data added to it. For example, a publication can have a previously missing publisher added to it.
Match - Uses the score-data produced by Score to evaluate pairs of records from the input and VIVO.
Records that match above a user-specified threshold are considered to be a matching set. Matching records from the input can be renamed to the matching VIVO URI, or can be linked to them via user-specified triples. Additionally, the input data can be sanitized to remove all rdf:type statements and all literal value statements, preserving only the node relationship triples.
wordiness - (optional) sets the lowest level of log messages to be displayed to the console. The lower the log level, the more detailed the messages.
Possible Values:
threshold - match records with a cumulative weighted-score over this value
Example:
link - (optional) link the two matched entities together using two provided predicates
Example:
rename - (optional) rename the input node to the matching VIVO URI
Possible Value:
clear-type-and-literals - (optional) sanitize the input data to remove all rdf:type statements and all literal value statements, preserving only the node relationship triples
Example:
batch-size - (optional) number of records to process in batch - default 150 - lower this if getting StackOverflow or OutOfMemory Exceptions
Example:
input-config - (optional - at least one of this and/or inputOverride) the configuration file that describes the input jena model. The parameters for this config file are described in the Models section below.
Example:
inputOverride - (optional - at least one of this and/or input-config) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.
Example:
score-config - (optional - at least one of this and/or scoreOverride) the configuration file that describes the score jena model. The parameters for this config file are described in the Models section below.
Example:
scoreOverride - (optional - at least one of this and/or score-config) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.
Example:
output-config - (optional - will contain all related nodes for matches) the configuration file that describes the output jena model. The parameters for this config file are described in the Models section below.
Example:
outputOverride - (optional - will contain all related nodes for matches) specify the parameters for the jena model without a config file and/or override specific parameters from the given config file. The parameters that can be set/overridden are described in the Models section below.
Example:
Short Option | Long Option | Parameter Value Map | Description | Required |
---|---|---|---|---|
i | inputJena-config | CONFIG_FILE | inputJena JENA configuration filename | true |
I | inputOverride | override the JENA_PARAM of inputJena jena model config using VALUE | false | |
o | output-config | CONFIG_FILE | outputConfig JENA configuration filename | true |
V | vivoOverride | override the JENA_PARAM of vivoJena jena model config using VALUE | false | |
s | score-config | CONFIG_FILE | score data JENA configuration filename | true |
S | scoreOverride | override the JENA_PARAM of score jena model config using VALUE | false | |
t | threshold | THRESHOLD | match records with a score over THRESHOLD | true |
l | link | link the two matched entities together using INPUT_TO_VIVO_PREDICATE and INPUT_TO_VIVO_PREDICATE | false | |
r | rename |
| rename or remove the matched entity from scoring | false |
c | clear-type-and-literals |
| clear all rdf:type and literal values out of the nodes matched | false |
//from the env file Match="java $OPTS -Dprocess-task=Match org.vivoweb.harvester.score.Match" //from the script file SCOREINPUT="-i $H2MODEL -ImodelName=$MODELNAME -IdbUrl=$MODELDBURL -IcheckEmpty=$CHECKEMPTY" SCOREDATA="-s $H2MODEL -SmodelName=$SCOREDATANAME -SdbUrl=$SCOREDATADBURL -ScheckEmpty=$CHECKEMPTY" MATCHOUTPUT="-o $H2MODEL -OmodelName=$MATCHEDNAME -OdbUrl=$MATCHEDDBURL -OcheckEmpty=$CHECKEMPTY" MATCHTHRESHOLD = 1.0 $Match $SCOREINPUT $SCOREDATA $MATCHOUTPUT -t $MATCHTHRESHOLD -r -c |
This call in the scripts will rename entries in $SCOREINPUT
according to $SCOREDATA
which is weighted above $MATCHTHRESHOLD
while clearing literals and types and sending a copy of the matched pieces to $MATCHOUTPUT
.
Scripts like pubmed which is interested in only matches will use the match output.
Find all nodes in the given namepsace matching on the given predicates
Rename the resource set as the key to the value matched
Link matched scoreResources to vivoResources using given linking predicates
The match class runs a sparql query on the score data. This can help access the score data for other purposes.
?sInput = Input URI ?sVivo = Vivo URI PREFIX scoreValue: <http://vivoweb.org/harvester/scoreValue/> SELECT DISTINCT ?sVivo ?sInput (sum(?weightValue) AS ?sum) WHERE { ?s scoreValue:InputRes ?sInput . ?s scoreValue:VivoRes ?sVivo . ?s scoreValue:hasScoreValue ?value . ?value scoreValue:WeightedScore ?weightValue . } GROUP BY ?sVivo ?sInput HAVING (?sum >= threshold ) ORDER BY ?sInput |