Date: Fri, 29 Mar 2024 10:13:44 -0400 (EDT) Message-ID: <813394437.119.1711721624462@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_118_1283914173.1711721624462" ------=_Part_118_1283914173.1711721624462 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
back up to How to plan data ingest for VIVO
previous topic: Ingest tools: home brew or off the shelf?
Note: these discussions reflect primarily the approaches and workflo= w that have been used at Cornell. Other approaches are used at other sites,= and please update or annotate as appropriate to point out different requir= ements and/or solutions.
Some VIVO sites do not allow manual editing by users, but reflect data f= rom one or more other systems of record with VIVO being a point of integrat= ion and for syndicating integrated data to other websites or reporting tool= s. This can simplify data management after it's in VIVO but still very like= ly requires data alignment unless all the sources of data are internally co= nsistent and share common unique identifiers.
When data in VIVO have been created or augmented by interactive editing,= and when users can edit their own pages (typically called self-editing), t= here are more complexities to plan for.
Ideally ingest processes are made repeatable and incremental so that cha= nges do not require removing and then adding large amounts of data, but som= etimes a source is only updated annually or the source system goes through = changes that require large batch changes.
The process of developing a data ingest plan for VIVO often focuses on e= ach different data source independently, but in fact there may be some over= lap among sources, whether those sources represent different types of data = or different sources of the same type of data.
For example, people will probably come first from a human resource datab= ase =E2=80=93 employees, departments, and the positions that connect e= mployees and departments. But a grants ingest process will also bring in ne= w people, as there may be investigators from other organizations listed. &n= bsp;And when publications are ingested, a large institution may find there = are tens of thousands of people records to keep straight.
In some future world that organizations like ORCID are working achieve, = every researcher will have a unique international identifier, and this iden= tifier will help disambiguate whether the John Doe that co-authored with a = researcher at your institution is the same John H. Doe serving as an invest= igator on a grant. For now, the mechanisms of identifiers and the heuristic= s of disambiguation are important to recognize but not to solve =E2=80= =93 it's primarily important in planning your ingest processes to recognize= that these questions are out there.
We don't recommend using a person's name as part of their URI for the si= mple reason that their name may change. In fact, many data architects= recommend always using completely randomized, meaningless identifiers with= in URIs (for the part after the last / in the URI, known as the local name)= .
When performing extract, transform, and load (ETL) tasks to get data int= o VIVO for the first time it will be necessary to create URIs for each new = person, organization, and other type of data ingested. These URIs can= be created by VIVO's ingest processes or generated by the ETL process itse= lf and loaded into VIVO. The ETL process can create any arbitrary URI as lo= ng as the local name begins with a letter =E2=80=93 some RDF processor= s are not happy with URIs having localnames beginning with a number or othe= r symbol. The ETL process can also create a URI based on an instituti= onal or other identifier, which has the advantage of being predictable and = repeatable. However, you need to be sure that the identifier is uniqu= e and will not be re-used in the future should the person leave the institu= tion or an organization identifier be recycled.
The goal with subsequent ingest =E2=80=93 either new types of data or up= dates to existing sources -- is to match new incoming data against the= existing URIs and their contents to avoid creating duplicates. This means = having some way of checking new data against existing data.
Creating nightly accumulators
At Cornell, we have found it advantageous to run a nightly process that = extracts a list of all people and all instances of several other types of e= ntities along with their URIs and key identifying properties such as name p= arts, email addresses, and so on. These lists serve as source against= which to match incoming data to avoid having to query our production VIVO = instance every time we encounter a co-author's name, a journal, or an organ= ization name. We call the lists accumulators, and store them in an XML form= at because our largest source of updates about researcher activities comes = from an XML web service.
These accumulator lists help assure that new data are matched against ex= isting data, reducing but not eliminating all possible false positives or f= alse negatives. We will discuss disambiguation in more detail further= along in the process in connection with How to manage data cleanup in VIVO.
next topic: Challenges for data ingest