Typical ingest processes

back up to How to plan data ingest for VIVO

previous topic: Ingest tools: home brew or off the shelf?

Note: this is an approach that has been used at Cornell. Other approaches are used at other sites.

Data in VIVO comes from ingested sources and from manual editing.
- Some VIVO sites do not allow manual editing by users. This can simplify the task.
- A separate VIVO instance is used for ingest.
  - This instance is populated from the nightly backup of the production instance.
  - The use of a separate VIVO means that the production instance is not loaded down by the ingest process.
  - Ingest processes run at night.
    - Since ingested data is largely separate from editable data, it is not likely that there would be conflicts, except for the load on the system.
    - Ingest processes are run that compare the new data to the data in VIVO.
    - They generate RDF triples that must be added or removed from VIVO to represent the new data.
    - Because we are not apply these triples immediately, we can inspect them for correctness before committing them.
    - The RDF triples are applied to the production VIVO system.
    - These processes are ad hoc, and idiosyncratic to Cornell’s data sources and ontology extensions. They are constantly being changed, and are not packaged for release.

Walking through a repeatable ingest process that tests new data against what is already in VIVO

Concepts

The process of developing a data ingest plan for VIVO often focuses on each different data source independently, but in fact there may be some overlap among sources, whether those sources represent different types of data or different sources of the same type of data.

For example, people will probably come first from a human resource database – employees, departments, and the positions that connect employees and departments. But a grants ingest process will also bring in new people, as there may be investigators from other organizations listed. And when publications are ingested, a large institution may find there are tens of thousands of people records to keep straight.

In some future world that organizations like ORCID are working achieve, every researcher will have a unique international identifier, and this identifier will help disambiguate whether the John Doe that co-authored with a researcher at your institution is the same John H. Doe serving as an investigator on a grant. For now, the mechanisms of identifiers and the heuristics of disambiguation are important to recognize but not to solve – it's primarily important in planning your ingest processes to recognize that these questions are out there.

Addressing identity

We don't recommend using a person's name as part of their URI for the simple reason that their name may change. In fact, many data architects remember always using completely randomized, meaningless identifiers within URIs (for the part after the last /, known as the local name).

next topic: Challenges for data ingest

Space shortcuts

Page tree

Walking through a repeatable ingest process that tests new data against what is already in VIVO

Concepts

Addressing identity