Shortcut to temporal Google doc for meeting notes: http://goo.gl/tCuNFm

Announcements

  • Ontology Working Group: next call November 4 at 1pm ET

    • Mailing list discussion about next’s weeks call -- main topic: the move to GitHub, Shahim describing ISF multiple ontology file vs single VIVO OWL file

    • On the Tuesdays when there is not a call, there are Ontology Office Hours using the same call-in information as the main Ontology calls

  • Apps & Tools

    • November 18 -- Josh Hanna from UF will detail how he hooked a test VIVO 1.7 vagrant at UF up to an alternative triple store (StarDog)

      • have to get your own download of StarDog in a zip file and then can use Vagrant to build one

      • Josh has put all 18 million triple

        • SPARQL queries that ran in the past are faster

      • Paul -- Eliza did the same thing with Virtuoso

        • working on tuning

        • when go from 300-500 publications it quintuples in query time

        • SPARQL queries are 5-10x as fast

      • Chris - could Eliza join the call on the 18th

    • Chris -- have produced an ntriples file of the UF dataset so have been able to produce a coverage document detailing which classes and properties are actually populated -- come to the ontology call to hear more

  • VIVO Hackathon - any other thoughts here? updates? questions?

  • VIVO Implementation Fest hold the dates -- please hold the week of March 16-20, 2015 as the date window for the 2015 implementation fest -- hosted by Oregon Health and Science University in Portland, OR

Theme: Inferencing and Reinferencing in VIVO

Guest presenter: Brian Lowe

 
  • Introduction

    • What is inferencing?

      • Brian -- will start at a basic level

      • A broad topic that can be used for a lot of things

      • how we use it in VIVO is very simple

      • basically, if you make a certain assertion, an inference is the ability to then say that other statements are true

      • e.g., if you assert that something is an Academic Department, a reasoner can infer that it is also a foaf:Organization

      • not very exciting, but very useful -- when rendering a page of all people, don’t have to cycle through all the different type assertions to figure out which types map up to foaf:Organization, but can just look for the statements of rdf:type foaf:Organization that have been added by the inferencing process

      • these extra triples get added into a separate graph in the Jena triplestore - called being “materialized” in to the graph - called the inference graph

      • allows you to get what you’re looking for more quickly

      • in the biomedical world you may be doing much more advanced reasoning, such as working off of associations with genes to determine what diseases may be implicated by certain symptoms in organisms

      • Similar to how Solr is used… inferencing makes it faster to access information... whenever triples are added through VIVO interfaces (user interface editing, API ingest) the reasoner is used to infer additional triples

        • If you only have a single RDF type triple, and you would need more complex queries to browse or search in VIVO -- inferred RDF type triples simplify these queries

        • Solr is fundamentally doing this in a very different way, however -- it’s flattening the information in the graph into “documents”

      • A separate piece of reasoning in VIVO related to the ontology

        • Used in part to determine what properties should appear on a page for a foaf:Person than a bibo:Document

        • The code that decides what properties to display on the page is dependent on the TBox reasoning -- the entire ontology as loaded into memory

        • It’s sort of arbitrary -- in OWL it’s possible to make an assertion about anything, including statements that are inconsistent with the ontology

        • But OWL and RDF are not like a schema language

      • One chunk of code that uses the Pellet reasoner

        • dates back to the first fully semantic version of VIVO that had an in-memory ontology model - all queries ran against the in-memory store but needed to be able to persist that on disk (in a database, via Jena)

        • had to move away from this when memory became a problem

        • was logical to employ a reasoner that would work in memory

          • a listener to the triples added or removed from the model would be sent over to the Pellet reasoner and its knowledge base

          • as those inferences were added or removed via Pellet, those were synchronized with the database store in the background

            • when editing, you would see your direct assertions first, and then see additional statements added by Pellet in a few seconds

          • we limited what we sent Pellet to do -- e.g., not data property reasoning

        • When we moved inferencing of all the data out of memory, we retained the reasoning of the ontology itself

          • what is a subclass of other classes

          • because OWL is description logic, set up to define a class and what properties it has, the reasoner can then figure out where that class fits in the class tree

          • Pellet does that reasoning using all the axioms, many of which are now added in other parts of the ISF -- that piece is still necessary

          • But the triples about all the data can’t be sent to Pellet because there isn’t enough memory for that

          • We created the VIVO simple reasoner as a limited way to get in all the triples that VIVO depends on to accurately render pages -- looks at the types on an individual, for example rdf:type vivo:AcademicDepartment, and adds in the extra types such as foaf:Organization, and that incremental reasoning is done right away so that the page displays correctly right away

            • don’t ever see the statement in its uninferred state

            • just the statements that VIVO depends on get inferred right away

          • Pellet does real OWL reasoning on the ontology itself, while the simple reasoner does only the inferencing necessary for VIVO to display correctly -- a hybrid system

      • Kind of different from the way some other applications use inferencing -- e.g. in a medical application, it may use inferencing to look for candidates of genes that might be implicated in some disease.

    • What is a reasoner and how is this used in VIVO?

      • 2007 when running everything in memory -- fast but size was limited

      • Pellet is a complete OWL reasoner

      • Pellet still classifies the ontology (the triples in the T-Box) -- not really taken advantage of in the original VIVO ontology, but other OWL ontologies not designed in the same way including some parts of ISF

      • when moved away from everything in memory… no longer sending all individuals (people, orgs, pubs, etc) to Pellet, just the ontology management

      • newer VIVO SimpleReasoner (class name?) for a hybrid solution -- a limited way of getting in particular triples that VIVO depends on to accurately render pages

    • What are the different types of inferencing?

    • What is reinferencing?

      • When you start a reinferencing, Pellet is not doing anything, no changes in T-box  -- just the Simple Reasoner

      • It’s just like rebuilding your search index -- the idea of a listener infrastructure is that hopefully you only need a full re-indexing or re-inferencing if your database got corrupted somehow

        • but with the Harvester, a lot of changes were introduced straight into the database, without VIVO being able to listener, so this became something that had to be done much more frequently than we originally thought

      • the simple reasoner has never been optimized to reason on the entire database --

        • does new inserts instantly

        • but when statements are deleted, it kicks off a small batch job to accomplish the deletion

        • the effort to date has been to optimize the experience of interactive editing rather than support large batch operations

        • redoing the entire thing for the entire database looks to the simple reasoner that the entire database has been redone from scratch

          • so the simple reasoner adds these triples into a temporary graph and at the end tries to reconcile that against the main application graph

          • it does a lot of scut work that it doesn’t need to do if the complete reinferencing were re-designed to to that entire replacement

        • the reinference of your database does not use Pellet

          • we assume your database is frozen at that point and is responding to what’s there -- Pellet is just at a temporary steady state with it’s knowledge of which classes are subclasses of which other classes and which properties are sub-properties -- that just sits in the TBox model and is not changed

        • Patrick -- if you had changed the ontology since the last re-inferencing, would Pellet know about it?

          • Brian -- Pellet would reflect the changes to the ontology in the model the simple reasoner users, but if you change the ontology in the middle of reinferencing the data, would not be sure of the results

          • Patrick -- we put in our own extra triples so don’t need the simple reasoner

          • Brian -- would be helpful to be able to configure more granularly

            • e.g., you can now switch on or off whether it pays attention to sameAs statements

            • could potentially turn off inverse property reasoning if you want to insert your own inverse statements

            • that would give people more flexibility

        • Brian -- has opened some issues for 1.8 and hopes to take care of them in the next couple weeks

          • doesn’t look too bad to make some headway

          • two basic approaches

            • 1 get rid of the temporary rebuild graph that the simple reasoner infers the current state in

            • if your base assertions haven’t changed very much it will only have to make the writes for what is change rather than doing everything and having to copy it over

            • and 2, just batching up the inserts should have a relatively big effect

              • because it was set up to slot in new triples one-by-one, should be a relatively simple tasks to put statements together into chunks that should yield a significant improvement

              • on his dev machine was getting a fixed time of 25 ms to do an insert of one tirple

              • now in batch each extra triple added into the batch only adds ⅓ or ½ millisecond

            • will be pursuing those two options can thinks that will help a lot with 1.8

            • still more that could be done

        • Jim -- are inferencing and reinferencing the same thing?

          • if I’m using the SPARQL update API and am adding 300K triples, that is inferencing -- not a full re-inferencing

          • The use of the temporary model only applies to re-inferencing, so we won’t get advantage from removing it

          • will the batching yield benefit with the API?

        • Brian -- if the batching does prove much more efficient we should try it in other circumstances than reinferencing, such as the API

          • but start there

          • Jim -- so you have a path forward to improving the full re-inferencing, but not to the incremental addition of triples, which can still be large

        • Could the batching be done in the RDF API, or is it more upstream?

          • Brian -- not immediately obvious how that would be done in the RDF service, where it’s more complex

          • Jim -- not to difficult to handle with “add RDF”

          • but with the SPARQL API we’ve ceded a lot of the parsing to Jena, since it has to interpret the SPARQL

      • This “Simple Reasoner” we are talking about -- is it implemented here? edu.cornell.mannlib.vitro.webapp.reasoner.SimpleReasoner

      • when you reinference, does it start from scratch? (clear the kb-inf?)

    • Are there different types of reinferencing? No -- seems to be a single form that relies on SimpleReasoner

    • When should you do inferencing and/or reinferencing?

    • How do you start it? from the Site Admin menu

    • Where are the inferences stored? kb-inf graph?

    • Curious why mysql from the beginning rather than a dedicated triple store?

    • Are there any special considerations for how a third party triple store could handle reinferencing?

      • Jon: Probably yes, but we don’t have a lot of experience

      • Brian: I’ve used Sesame with materialized triples generated from reinferencing.

      • WIth OWLIM it does a similar kind of materializing of inferred statements For the most part, it works nicely.

      • One gotcha is that Vitro uses something called mostSpecificType but Sesame uses a different property for that concept -- so have to make that configurable

      • and other triple stores may not do the most specific type piece

      • you might need to have VIVO handle that bit of reasoning but have the triple store do 90% of the class subsumption reasoning

      • might get data repeated on the page inadvertently -- if in the past have thrown a more capable reasoner

        • super-properties inferred as well -- e.g., hasPart when you are using sub-properties like hasSubOrganization

        • can be addressed through more sophisticated application logic, as is done with the “faux” properties in VIVO as a first step in an application ontology

      • we need to experiment with the capabilities of a number of external triple stores to see what advantages or disadvantages they offer

      • and we will need to modify VIVO to have its inferencing much more configurable

    • what is a faux property?

      • they are properties never defined in your ontology or in your database, but a contextual configuration for a real property

      • e.g., the “bearer of” property to link a person role so that we can break that out in VIVO with a context-specific label such as “has investigator role” or “has leadership role” depending on the object involved

      • a little complex

      • Jim -- a faux property is not represented by a triple in the triple store? Yes, but it’s stored differently in the triple store than you might expect looking at the application

      • Jon -- there is a triple there, but under a very generic property

      • Jim -- the triple looks to be there for display purposes, and we don’t want to see both it and the more generic one

      • Stephan - ‘faux’ properties are just a poor implementation of the qualified relation pattern http://patterns.dataincubator.org/book/qualified-relation.html 

        • (Jon, later) -- qualified relations look to be very much like the vivo:Relationship class, which is not the same thing as a faux property, which is about contextual labeling of a direct object property between two entities

      • Brian - Faux properties have nothing to do with how the data is actually modeled or the patterns used in the ontology.  It’s just a way of configuring the VIVO application to apply a different label (or other settings such as rank position, editability, etc.) to a predicate in a certain context.

    • Does all of this apply to both Vitro and VIVO, or is there inferencing/reinferencing specific source code/logic for VIVO (separate from anything specified in the VIVO-ISF ontology itself)?

      • not seeing any direct references to SimpleReasoner in VIVO source code project, so inferencing/reinferencing seems to be fully handled by core Vitro code

      • Could we bring up GitHub and do a quick drive-by of some of the directories or classes involved here?

    • Patrick -- in our Ruby ingest, we map faux properties to a human-readable string

  • Bugs and fixes

    • Performance

    • Brian Lowe: "I spent some time yesterday investigating the benefit of doing the infererred triple inserts in larger batches rather than one-by-one, and at least on my machine this looks like it will offer a very significant improvement in speed.  I opened a few issues for myself for 1.8.  As part of the batching change, I’ll also modify it so it uses the RDFService directly for getting access to the triple store, instead of going through the additional legacy Jena model layer.  This should avoid waits for model locks."

  • What are some efficiencies to be had?

 

Notable List Traffic

See the vivo-dev-all archive and vivo-imp-issues archive for complete email threads

Call-in Information

Calls are held every Thursday at 1 pm eastern time – convert to your time at http://www.thetimezoneconverter.com

  • Date: Every Thursday, no end date
  • Time: 1:00 pm, Eastern Daylight Time (New York, GMT-04:00)
  • Meeting Number: 641 825 891

To join the online meeting

1. Call in to the meeting:

   1-855-244-8681 (Call-in toll-free number (US/Canada))

   1-650-479-3207 (Call-in toll number (US/Canada))

2. Enter the access code:

   641 825 891 #

3. Enter your Attendee ID:

   8173 #

  • No labels