Background

The Current Workflow

See Typical harvest

The Issue

While this process works, it requires the processing of every record during every run of the harvester, including a potentially huge portion of the records that have not been modified at all. Additionally, this process compares the current harvest with 'last-harvest' model that must be synchronized with any modifications in the live VIVO model (vitro-kb-2). This causes a confusing amount of extra work for those manually maintaining the data or causes updates from the source to create anomalous 'extra' triples.

The Solution

The solution is to separate out changed data as soon as possible, discarding unchanged data since we know we need to do nothing to it. Comparing the raw data from the source we can isolate changes before putting it through the entire harvest process. This, potentially, will drastically decrease the time it takes successive runs of the harvester (for updates). Additionally, since we are comparing the raw data from the source, we can isolate the type of change as well. New records, Deleted records, and modified fields in a record can be isolated and handled in separate, appropriate ways. New records sent through a process similar to the current harvest workflow, deleted records merely being matched to vivo and purging the data from vivo, updated fields being correctly updated - even allowing for consideration for the sources authoritativeness.

Too implement this concept using the harvester, it would be best to add in a few tools to the toolset. Leveraging some of the advantages of our toolset (the concept of Records being unique, comparable objects) we can create these tools fairly easily.