Implementation and Development Call 20141030

Shortcut to temporal Google doc for meeting notes: http://goo.gl/tCuNFm

Announcements

Ontology Working Group: next call November 4 at 1pm ET

Mailing list discussion about next’s weeks call -- main topic: the move to GitHub, Shahim describing ISF multiple ontology file vs single VIVO OWL file
On the Tuesdays when there is not a call, there are Ontology Office Hours using the same call-in information as the main Ontology calls

Apps & Tools

November 18 -- Josh Hanna from UF will detail how he hooked a test VIVO 1.7 vagrant at UF up to an alternative triple store (StarDog)

have to get your own download of StarDog in a zip file and then can use Vagrant to build one
Josh has put all 18 million triple

SPARQL queries that ran in the past are faster

Paul -- Eliza did the same thing with Virtuoso

working on tuning
when go from 300-500 publications it quintuples in query time
SPARQL queries are 5-10x as fast

Chris - could Eliza join the call on the 18th

Chris -- have produced an ntriples file of the UF dataset so have been able to produce a coverage document detailing which classes and properties are actually populated -- come to the ontology call to hear more

VIVO Hackathon - any other thoughts here? updates? questions?
VIVO Implementation Fest hold the dates -- please hold the week of March 16-20, 2015 as the date window for the 2015 implementation fest -- hosted by Oregon Health and Science University in Portland, OR

Theme: Inferencing and Reinferencing in VIVO

Guest presenter: Brian Lowe

Introduction

What is inferencing?

Brian -- will start at a basic level
A broad topic that can be used for a lot of things
how we use it in VIVO is very simple
basically, if you make a certain assertion, an inference is the ability to then say that other statements are true
e.g., if you assert that something is an Academic Department, a reasoner can infer that it is also a foaf:Organization
not very exciting, but very useful -- when rendering a page of all people, don’t have to cycle through all the different type assertions to figure out which types map up to foaf:Organization, but can just look for the statements of rdf:type foaf:Organization that have been added by the inferencing process
these extra triples get added into a separate graph in the Jena triplestore - called being “materialized” in to the graph - called the inference graph
allows you to get what you’re looking for more quickly
in the biomedical world you may be doing much more advanced reasoning, such as working off of associations with genes to determine what diseases may be implicated by certain symptoms in organisms
Similar to how Solr is used… inferencing makes it faster to access information... whenever triples are added through VIVO interfaces (user interface editing, API ingest) the reasoner is used to infer additional triples

If you only have a single RDF type triple, and you would need more complex queries to browse or search in VIVO -- inferred RDF type triples simplify these queries
Solr is fundamentally doing this in a very different way, however -- it’s flattening the information in the graph into “documents”

A separate piece of reasoning in VIVO related to the ontology

Used in part to determine what properties should appear on a page for a foaf:Person than a bibo:Document
The code that decides what properties to display on the page is dependent on the TBox reasoning -- the entire ontology as loaded into memory
It’s sort of arbitrary -- in OWL it’s possible to make an assertion about anything, including statements that are inconsistent with the ontology
But OWL and RDF are not like a schema language

One chunk of code that uses the Pellet reasoner

dates back to the first fully semantic version of VIVO that had an in-memory ontology model - all queries ran against the in-memory store but needed to be able to persist that on disk (in a database, via Jena)
had to move away from this when memory became a problem
was logical to employ a reasoner that would work in memory

a listener to the triples added or removed from the model would be sent over to the Pellet reasoner and its knowledge base
as those inferences were added or removed via Pellet, those were synchronized with the database store in the background

when editing, you would see your direct assertions first, and then see additional statements added by Pellet in a few seconds

we limited what we sent Pellet to do -- e.g., not data property reasoning

When we moved inferencing of all the data out of memory, we retained the reasoning of the ontology itself

what is a subclass of other classes
because OWL is description logic, set up to define a class and what properties it has, the reasoner can then figure out where that class fits in the class tree
Pellet does that reasoning using all the axioms, many of which are now added in other parts of the ISF -- that piece is still necessary
But the triples about all the data can’t be sent to Pellet because there isn’t enough memory for that
We created the VIVO simple reasoner as a limited way to get in all the triples that VIVO depends on to accurately render pages -- looks at the types on an individual, for example rdf:type vivo:AcademicDepartment, and adds in the extra types such as foaf:Organization, and that incremental reasoning is done right away so that the page displays correctly right away

don’t ever see the statement in its uninferred state
just the statements that VIVO depends on get inferred right away

Pellet does real OWL reasoning on the ontology itself, while the simple reasoner does only the inferencing necessary for VIVO to display correctly -- a hybrid system

Kind of different from the way some other applications use inferencing -- e.g. in a medical application, it may use inferencing to look for candidates of genes that might be implicated in some disease.

What is a reasoner and how is this used in VIVO?

2007 when running everything in memory -- fast but size was limited
Pellet is a complete OWL reasoner
Pellet still classifies the ontology (the triples in the T-Box) -- not really taken advantage of in the original VIVO ontology, but other OWL ontologies not designed in the same way including some parts of ISF
when moved away from everything in memory… no longer sending all individuals (people, orgs, pubs, etc) to Pellet, just the ontology management
newer VIVO SimpleReasoner (class name?) for a hybrid solution -- a limited way of getting in particular triples that VIVO depends on to accurately render pages

What are the different types of inferencing?
What is reinferencing?

When you start a reinferencing, Pellet is not doing anything, no changes in T-box -- just the Simple Reasoner
It’s just like rebuilding your search index -- the idea of a listener infrastructure is that hopefully you only need a full re-indexing or re-inferencing if your database got corrupted somehow

but with the Harvester, a lot of changes were introduced straight into the database, without VIVO being able to listener, so this became something that had to be done much more frequently than we originally thought

the simple reasoner has never been optimized to reason on the entire database --

does new inserts instantly
but when statements are deleted, it kicks off a small batch job to accomplish the deletion
the effort to date has been to optimize the experience of interactive editing rather than support large batch operations
redoing the entire thing for the entire database looks to the simple reasoner that the entire database has been redone from scratch

so the simple reasoner adds these triples into a temporary graph and at the end tries to reconcile that against the main application graph
it does a lot of scut work that it doesn’t need to do if the complete reinferencing were re-designed to to that entire replacement

the reinference of your database does not use Pellet

we assume your database is frozen at that point and is responding to what’s there -- Pellet is just at a temporary steady state with it’s knowledge of which classes are subclasses of which other classes and which properties are sub-properties -- that just sits in the TBox model and is not changed

Patrick -- if you had changed the ontology since the last re-inferencing, would Pellet know about it?

Brian -- Pellet would reflect the changes to the ontology in the model the simple reasoner users, but if you change the ontology in the middle of reinferencing the data, would not be sure of the results
Patrick -- we put in our own extra triples so don’t need the simple reasoner
Brian -- would be helpful to be able to configure more granularly

e.g., you can now switch on or off whether it pays attention to sameAs statements
could potentially turn off inverse property reasoning if you want to insert your own inverse statements
that would give people more flexibility

Brian -- has opened some issues for 1.8 and hopes to take care of them in the next couple weeks

doesn’t look too bad to make some headway
two basic approaches

1 get rid of the temporary rebuild graph that the simple reasoner infers the current state in
if your base assertions haven’t changed very much it will only have to make the writes for what is change rather than doing everything and having to copy it over
and 2, just batching up the inserts should have a relatively big effect

because it was set up to slot in new triples one-by-one, should be a relatively simple tasks to put statements together into chunks that should yield a significant improvement
on his dev machine was getting a fixed time of 25 ms to do an insert of one tirple
now in batch each extra triple added into the batch only adds ⅓ or ½ millisecond

will be pursuing those two options can thinks that will help a lot with 1.8
still more that could be done

Jim -- are inferencing and reinferencing the same thing?

if I’m using the SPARQL update API and am adding 300K triples, that is inferencing -- not a full re-inferencing
The use of the temporary model only applies to re-inferencing, so we won’t get advantage from removing it
will the batching yield benefit with the API?

Brian -- if the batching does prove much more efficient we should try it in other circumstances than reinferencing, such as the API

but start there
Jim -- so you have a path forward to improving the full re-inferencing, but not to the incremental addition of triples, which can still be large

Could the batching be done in the RDF API, or is it more upstream?

Brian -- not immediately obvious how that would be done in the RDF service, where it’s more complex
Jim -- not to difficult to handle with “add RDF”
but with the SPARQL API we’ve ceded a lot of the parsing to Jena, since it has to interpret the SPARQL

This “Simple Reasoner” we are talking about -- is it implemented here? edu.cornell.mannlib.vitro.webapp.reasoner.SimpleReasoner
when you reinference, does it start from scratch? (clear the kb-inf?)

Are there different types of reinferencing? No -- seems to be a single form that relies on SimpleReasoner
When should you do inferencing and/or reinferencing?
How do you start it? from the Site Admin menu
Where are the inferences stored? kb-inf graph?
Curious why mysql from the beginning rather than a dedicated triple store?
Are there any special considerations for how a third party triple store could handle reinferencing?

Jon: Probably yes, but we don’t have a lot of experience
Brian: I’ve used Sesame with materialized triples generated from reinferencing.
WIth OWLIM it does a similar kind of materializing of inferred statements For the most part, it works nicely.
One gotcha is that Vitro uses something called mostSpecificType but Sesame uses a different property for that concept -- so have to make that configurable
and other triple stores may not do the most specific type piece
you might need to have VIVO handle that bit of reasoning but have the triple store do 90% of the class subsumption reasoning
might get data repeated on the page inadvertently -- if in the past have thrown a more capable reasoner

super-properties inferred as well -- e.g., hasPart when you are using sub-properties like hasSubOrganization
can be addressed through more sophisticated application logic, as is done with the “faux” properties in VIVO as a first step in an application ontology

we need to experiment with the capabilities of a number of external triple stores to see what advantages or disadvantages they offer
and we will need to modify VIVO to have its inferencing much more configurable

what is a faux property?

they are properties never defined in your ontology or in your database, but a contextual configuration for a real property
e.g., the “bearer of” property to link a person role so that we can break that out in VIVO with a context-specific label such as “has investigator role” or “has leadership role” depending on the object involved
a little complex
Jim -- a faux property is not represented by a triple in the triple store? Yes, but it’s stored differently in the triple store than you might expect looking at the application
Jon -- there is a triple there, but under a very generic property
Jim -- the triple looks to be there for display purposes, and we don’t want to see both it and the more generic one
Stephan - ‘faux’ properties are just a poor implementation of the qualified relation pattern http://patterns.dataincubator.org/book/qualified-relation.html

(Jon, later) -- qualified relations look to be very much like the vivo:Relationship class, which is not the same thing as a faux property, which is about contextual labeling of a direct object property between two entities

Brian - Faux properties have nothing to do with how the data is actually modeled or the patterns used in the ontology. It’s just a way of configuring the VIVO application to apply a different label (or other settings such as rank position, editability, etc.) to a predicate in a certain context.

Does all of this apply to both Vitro and VIVO, or is there inferencing/reinferencing specific source code/logic for VIVO (separate from anything specified in the VIVO-ISF ontology itself)?

not seeing any direct references to SimpleReasoner in VIVO source code project, so inferencing/reinferencing seems to be fully handled by core Vitro code
Could we bring up GitHub and do a quick drive-by of some of the directories or classes involved here?

Patrick -- in our Ruby ingest, we map faux properties to a human-readable string

Bugs and fixes

Performance
Brian Lowe: "I spent some time yesterday investigating the benefit of doing the infererred triple inserts in larger batches rather than one-by-one, and at least on my machine this looks like it will offer a very significant improvement in speed. I opened a few issues for myself for 1.8. As part of the batching change, I’ll also modify it so it uses the RDFService directly for getting access to the triple store, instead of going through the additional legacy Jena model layer. This should avoid waits for model locks."

What are some efficiencies to be had?

Duke’s workflow: https://gist.github.com/patrickmcelwee/7377676

Notable List Traffic

See the vivo-dev-all archive and vivo-imp-issues archive for complete email threads

Call-in Information

Calls are held every Thursday at 1 pm eastern time – convert to your time at http://www.thetimezoneconverter.com

Date: Every Thursday, no end date
Time: 1:00 pm, Eastern Daylight Time (New York, GMT-04:00)
Meeting Number: 641 825 891

To join the online meeting

Go to https://cornell.webex.com/cornell/e.php?AT=WMI&EventID=167096322&RT=MiM2
If requested, enter your name and email address.
Click "Join".

1. Call in to the meeting:

1-855-244-8681 (Call-in toll-free number (US/Canada))

1-650-479-3207 (Call-in toll number (US/Canada))

2. Enter the access code:

641 825 891 #

3. Enter your Attendee ID:

8173 #

Space shortcuts

Page tree

Announcements

Theme: Inferencing and Reinferencing in VIVO

Guest presenter: Brian Lowe

Notable List Traffic

Call-in Information