Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
outlinetrue
absoluteUrltrue
stylenone

Panel
Excerpt
hiddentrue

How VIVO differs from a spreadsheet, where VIVO data typically comes from, cleaning data prior to loading, matching against data already in VIVO, and doing further cleanup once it's in VIVO

 

 

How VIVO differs from a spreadsheet

VIVO stores data as RDF individuals – entities that are instances of OWL classes and that relate to each other through OWL object properties and have attributes represented as OWL datatype property statements. Put very simply,

  • Classes are types – Person, Event, Book, Organization
  • Individuals are instances of types – Joe DiMaggio, the 2014 AAAS Conference, Origin of Species, or the National Science Foundation
  • Object properties express relationships between to individual entities, whether of the same or different types – a book has a chapter, a person attends an event
  • Datatype properties (data properties for short) express simple attribute relationships for one individual – a time, date, short string, or full page of text

Every class, property, and individual has a URI that serves as an identifier but is also resolvable on the Web as Linked Data.

A triple in RDF has a subject, a predicate, and an object – think back to sentence diagramming in junior high school if you go back that far.  

Image Added

So far all this would fit in a spreadsheet – one row per statement, but never more than 3 columns.

This may not be the most useful analogy, however, since you can't say very much in a single, 3-part statement and your data will be much more complex than that.  A person has a first name, middle name, and lastname; and a title, a position linking them to a department; many research interests; sometimes hundreds of publications, and so on.  In a spreadsheet world you can keep adding columns to represent more attributes, but that soon breaks down.

But let's stay simple and say you only want to load basic directory information in VIVO – name, title, department, email address, and phone number.

nametitledepartmentemailphone
Sally JonesDepartment ChairEntomologysj24@university.edu888 777-6666
Ruth GinsleyProfessorClassicsrbg12@university.edu888 772-1357
Sam SneadTherapistHealth Servicesss429@university.edu888 772-7831

Piece of cake – until you have  a person with 2 (or 6) positions (it happens).  Or two offices and hence two work phone numbers.

VIVO breaks data apart in chunks of information that belong together in much the same way that relational databases store information about different types of things in different tables.  There's no right or wrong way to do it, but VIVO stores the person independently of the position and the department – the position has information a person's title and their beginning and ending date, while the department will be connected to multiple people through their positions but also to grants, courses, and other information.

VIVO even stores a person's detailed name and contact information as a vCard, a W3C standard ontology that itself contains multiple chunks of information. More on this later.

Storing information in small units removes the need to specify how many 'slots' to allow in the data model while also allowing information to be assembled in different ways for different purposes – a familiar concept from the relational database world, but accomplished through an even more granular structure of building blocks – the RDF triple.  There are other important differences as well – if you want to learn more, we recommend The Semantic Web for the Working Ontologist, by Dean Allemang and Jim Hendler.

Where VIVO data typically comes from

It's perfectly possible, if laborious, to add all data to VIVO through interactive editing. For a small research institution this may be the preferred method, and many VIVO institutions employ students or staff to add and update information for which no reliable system of record exists.  If VIVO has been hooked up to the institutional single sign-on, self-editing by faculty members or researchers has been used effectively, especially if basic information has been populated and the focus of self editing is on populating research interests, teaching statements, professional service, or other more straightforward information.

This approach does not scale well to larger institutions, and full reliance on researchers do editing brings its own problems of training, consistency in data entry, and motivating people to keep content up to date. Many VIVOs are supported through libraries that are more comfortable providing carrots than sticks and want the VIVO outreach message to focus on positive benefits vs. threats about stale content or mandates to enter content for annual reporting purposes.

VIVO is all about sharing local data both locally and globally.  Much of the local data typically resides in "systems of record" – formerly entirely locally hosted and often homegrown, but more recently starting to migrate to open source software (e.g, Kuali) or to cloud solutions.

These systems of record are often silos used for a defined set of business purposes such as personnel and payroll, grants administration, course registration and management, an institutional repository, news and communications, event calendar(s), or extension.  Even when the same software platform is used, local metadata requirements and functional customizations may make any data source unique.

For this reason, coupled with variations in local technology skills and support environments, the VIVO community has not developed cookie cutter, one-size-fits-all solutions to ingest.

Here's a rough outline of the approach we recommend for people new to VIVO when they start thinking about data:

  1. Look at other VIVOs to see what data other people have loaded and how it appears
  2. Learn the basics about RDF and the Semantic Web (there's an excellent book) – write 10 lines of RDF
  3. Download and install VIVO on a PC or Mac, most easily through the latest VIVO Vagrant instance
  4. Add one person, a position, a department, and some keywords using the VIVO interactive editor
  5. Export the data as RDF and study it – what did you think you were entering, and how different is the structure that VIVO creates?
  6. Add three more people, their positions, and a couple more departments. Try exporting again to get used to the RDF VIVO is creating.
  7. Try typing a small RDF file with a few new records, using the same URI as VIVO created if you are repeating the same entity. Load that RDF into VIVO through the Add/Remove RDF command on the Site Admin menu – does it look right?  If not, double check your typing.
  8. Repeat this with a publication to learn about the Authorship structure – enter by hand, study, hand edit a couple of new records, ingest, test
  9. Don't be afraid to start over with a clean database – you are learning, not going directly to production.
  10. When you start to feel comfortable, think about what data you want to start with – perhaps people and their positions and titles. Don't start with the most challenging and complex data first.
  11. Work with a tool like Karma to learn semantic modeling and produce RDF, however simple, without writing your own code
  12. Load the data from Karma into your VIVO – does it look right?  If not, look at the differences in the RDF – it may be subtle
  13. Then start looking more deeply at the different ingest methods described later in this section of the wiki and elsewhere

Cleaning data prior to loading

If you have experience with data you'll no doubt be familiar with having to clean your data before importing it into software designed to show that data publicly. There will be duplicates, uncontrolled name variants, missing data, mis-spellings, capitalization differences, oddball characters, international differences in names, broken links, and very likely other problems.

VIVO is not smart enough to fix these problems since it in almost all cases has no means of distinguishing an important subtle variation from an error. With the Semantic Web, 'inference' does not refer to the ability to second-guess intent, nor 'reasoning' to invoking artificial intelligence.

Furthermore, it's almost always easier to fix dirty data before it goes into VIVO than to find and fix it after loading.

There's another important consideration – the benefit of fixing data at its source.  Putting data into a VIVO makes that data much more discoverable through both searching and browsing. People will find errors that are most likely hidden in the source system of record – but if you fix them in VIVO, they won't be correct in the source system, and the next ingest runs the risk of overwriting changes.

There are many ways to clean data, but in the past few years Open Refine has emerged offering quite remarkable capabilities including the ability to develop changes interactively that can then be saved as scripts to run repeatedly on larger batches.

Matching against data already in VIVO

The first data ingest from a clean slate is the easiest one.  After the first time there's a high likelihood that updates will refer to many of the same people, organizations, events, courses, publications, etc.  You don't want to introduce duplicates, so you need to check new information against what's already there.

Sometimes this is straightforward because you have a unique identifier to definitively distinguish new data from existing – your institutional identifier for people, but also potentially an ORCID iD, a DOI on a publication, or a FundRef registry identifier.

Often the matching is more challenging – is 'Smith, D' a match with Delores Smith or David Smith?  Names will be misspelled, abbreviated, changed over time – and while a lot can be done with relatively simple matching algorithms, this remains an area of research, and of homegrown, ad-hoc solutions embedded in local workflows.

One way to check each new entry is to query VIVO, directly or through a derivative SPARQL endpoint using Fuseki or other tool, to see the identifier or name matches data already in VIVO. As VIVO scales, there can be performance reasons for extracting a list of names of people, organizations, and other common data types for ready reference during an ingest process.  It's also sometimes possible to develop an algorithm for URI construction that is reliably repeatable and hence avoids the need to query – for example, if you embed the unique identifier for a person in their VIVO URI through some simple transformation, you should be able to predict a person's URI without querying VIVO. If that person has not yet been added to VIVO, adding them through a later ingest is not a problem as long as the same URI would have been generated.  This requires that the identifier embedded in the URI is not private or sensitive (e.g., not the U.S. social security number) and will not change.

One caution here – it's important to think carefully about the default namespace you use for URIs in VIVO if you want linked data requests to work – please see A simple installation and look for the basic settings of the Vitro.defaultNamespace.

Doing further cleanup once in VIVO

VIVO's interactive editing can be helpful in fixing problems in relatively small datasets but it's important to remember that if data originate outside of VIVO and are not corrected at the source, subsequent updates will likely re-introduce the error.

It's also common to have discrepancies in the source data – for example, the naming conventions and identifiers used for departments in a personnel database vs. those used in a grants administration system. There is a command to merge two individuals in VIVO, specifying which URI to retain, but that will combine the statements associated with both, leading to duplicate labels and potentially other duplicates.

VIVO has only limited support for owl:sameAs reasoning due to the performance implications of having to query for all statements about more than one URI whenever rendering information about any one URI declared to be sameAs another.

Many VIVO installations have developed workflows for checking VIVO data, ranging from broken link checkers to nightly SPARQL queries to detect malformed data such as publications without authors, people without identifiers, 'orphaned' dates no longer referenced by any property statements, and so forth.  These tools have been discussed on previous Apps and Tools Interest Group calls that have been recorded and uploaded to YouTube.

 

Children Display
alltrue
depth2
styleh2
excerpttrue

introduction

Children Display

...

You've looked at VIVO, you've seen VIVO in action at other universities or organizations, you've downloaded and installed the code.  What next? How do you get information about your institution into your VIVO?

The answer may be different everywhere – it depends on a number of factors.

  • How big is your organization? Some smaller ones have implemented VIVO only through interactive editing – they enter every person, publication, organizational unit, grant, and event they wish to show up, and the keep up with changes "manually" as well. This approach works well for organizations with under 100 people or so, especially if you have staff or student employees who are good at data entry and enjoy learning more about the people and the research.  There's something of an inverse correlation with age – students can be blazingly fast with data entry, employing multiple windows and copying and pasting content.  The site takes shape before your eyes and it's easy to measure progress and, after a bit of practice, predict how long the process will take.
    • This approach may also be a good way to develop a working prototype with local data to use in making your case for a full-scale effort.  The process of data entry is tedious but a very good way to learn the structure inherent in VIVO.
    • We recommend that people new to RDF and ontologies enter representative sample data by hand and then export it in one of the more readable RDF formats such as n3, n-triples, or turtle.  This is an excellent way to compare what you see on the screen with the data VIVO will actually produce – and when you know your target, it's easier to decide how best to develop a more automated ingest process.
  • The interactive approach will obviously not work with big institutions or where staff time or a ready pool of student editors is not available.  There are also many advantages to developing more automated means of ingest and updating, including data consistency and the ability to replace data quickly and on a predictable timetable.
  • What are your available data sources?  Some organizations have made good institutional data a priority, and others struggle with legacy systems lacking consistent identifiers or common definitions for important categorizations such as distinct types of units or employment positions. You may have to do make some inquiries to find the right people to contact to find out what data are available, and the stakeholders on your VIVO project may need to request access to that data.

Next – what is different about data in VIVO?

As we've described, it's well worth learning the VIVO editing environment and creating sample data even if you know you will require an automated approach to data ingest and update.

VIVO makes certain assumptions about data based largely on the types of data, relationships, and attributes described in the VIVO ontology.  These assumptions do not always follow traditional row and column data models, primarily because the application almost always allows for arbitrarily repeating values rather than holding strictly to a fixed number of values per record.  Publications may most frequently have fewer than five authors, but in some fields such as experimental physics it's common to see hundreds of authors – not very workable in a one-row-per-publication, one-column-per-author spreadsheet model.

In VIVO, data about people, organizations, events, courses, places, dates, grants, and everything else are stored in one very simple, three-part structure – the RDF statement.  A statement, or triple, has a subject (any entity), a predicate or property, and an object that can be either another related entity or a simple data value such as a number, text string, or date.  While users will see VIVO data expressed in larger aggregations as web pages, internally VIVO is storing its data as RDF statements or triples.  

This is not the place to explain everything about RDF – there are many good tutorials available and other sections of this wiki explain the VIVO ontology and the more technical aspects of RDF. For now, just bear in mind that while the data you receive may come to you in one format, much of the work of data ingest involves decomposing that data into simple statements that will then be re-assembled by the VIVO application, guided by the ontology, into a coherent web page or a packet of Linked Open Data.

What data can VIVO accept?

With VIVO, your destination will be RDF but you may receive the data in a variety of formats. A first stage in planning ingest involves analyzing what data you have access to and mapping on paper how it need to be transformed for VIVO.

It's probably most common for data to be provided in spreadsheet format, which can be very simple to transform into RDF if each column of every row refers to attributes of the same entity, usually identified by a record identifier. The process becomes more complicated if different cells in the same row of the spreadsheet refer to different entities.

The following spreadsheet would be very easy to load into a VIVO describing cartoon characters:

idnameheightage
1Goofy89 cm11
2Elmer Fudd60 cm45
3Roadrunner140 cm2

You can readily imagine a storing the information about each cartoon character – id, name, height, and age – in one entity for each character.

A spreadsheet of books, however, would be more complicated:

idtitlepublication dateauthorpublisherpages
497531Cartoon Animation1967Wilcox, GeorgeHB Press237
501378Animation Techniques1989Smith, CharlotteCinema Press359
391783Digital Animation2005Ivar, SamuelDigital Logic, Inc.327
34682Dairy Barn Automation2011Wilcox, G.P.University of Minnesota Press403

VIVO stores the book, each author, and the publisher as independent entities related to the other.  This enables information about the book, authors, and publisher to be queried and displayed independently, a key feature of the semantic data model.

This example also points out another challenge in working with data – it's not always clear when values that appear similar actually represent the same entity, whether a person, organization, title, journal, or event.  It would be easy to assume the George Wilcox in the first entry is the same as G.P. Wilcox in the 4th, but they are writing about very different topics. For a small organization, it may be easy to disambiguate authors, but this becomes a major challenge at the scale of a major research university.

Data cleanup and disambiguation are challenges for any system and will be a common theme in documenting VIVO data ingest along with semantic data modeling that is more specific to working with VIVO.

Contents

Children Display