You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Introduction

The LD4L ontology team formed very early in the project and has met on a weekly basis for discussions on a wide range of topics, from proposing possible use cases to reviewing the ontology aspects of use cases proposed by other teams to discussing the specifics of how best to represent the data coming from our three library catalogs and from other internal and external sources.

Team members are listed on the Working Groups page, and the team has benefited from the addition of new members including strong representation from the technical services and metadata departments of Cornell, Harvard, and Stanford.

Several principles have guided our discussions and influenced the projects selected by ontology team members.  The group early on confirmed the intention stated in the proposal to reuse appropriate parts of currently available ontologies rather than building a new, self-contained ontology for LD4L. While there are advantages to working from a blank slate, we believe it makes eminent sense for a project focused on linked data to draw as much as possible on existing ontologies that already have achieved significant adoption or show promise for doing so.

At the outset the ontology team recognized the existence of a great deal of prior art in the form of published ontologies and significant ongoing ontology initiatives addressing the representation of bibliographic information in RDF. Elements of the Bibliographic Ontology and FaBIO had already been incorporated into the VIVO-ISF ontology and were familiar to team members – and Paolo Ciccarese from Harvard was a principal FABIO contributor. The BIBFRAME initiative at the Library of Congress addresses the representation of MARC metadata in RDF, while OCLC has worked to extend the Schema.org ontology as a bridge between the library community and the Web.

From the LD4L proposal

SRSIS Ontology

Because no existing ontology supports the range of entities and relationship that SRSIS will encompass, we will use the Protégé ontology editor to develop a SRSIS ontology framework that reuses appropriate parts of currently available ontologies while introducing extensions and additions where necessary.  The framework will be based on and remain compatible with the existing VIVO and emerging research dataset and research resource ontology work. It will be sufficiently expressive to encompass traditional catalog metadata from both Cornell and Harvard, the basic linked data elements described in the Stanford Linked Data Workshop Technology Plan, and the usage and other contextual elements from StackLife. The ontology will capture a series of basic concepts and be structured as modules that draw inspiration from and reuse existing ontology classes and properties where appropriate, such as the Semantic Publishing and Referencing ontologies, and that also support arbitrary system-wide refinement, including local extensions.

Ontology team activities to date

Strings to things

Connecting library metadata with linked data 'in the wild' is a central goal of the LD4L project.  To that end much of the ontology team's work has focused on identifying external authorities, stable identifiers (preferably URIs), and sources and services capable of linking the people, places, organizations, events, and subject headings in library metadata to real world entities. In some cases existing metadata in both MARC and non-MARC metadata includes references to local or external authorities, but the vast majority of potentially identifiable entities are represented only as strings of characters. Some of our catalog records have been linked to Library of Congress, OCLC (including the VIAF international authority file), or ISNI identifiers through contracts or internal record enhancement projects, and an unrelated project at Harvard has focused on entity recognition within Encode Archival Description collections. A need to extend from authority file links or a registry of named entities to resolvable URIs compatible with linked data has motivated several LD4L investigations, with some focusing on quality and others more on the efficacy of existing services.

Converting MARC to RDF

For MARC metadata, the team has worked with the Library of Congress BIBFRAME converter as a central component in a workflow that may include pre-processing to address variations in local MARC cataloging practice and in most cases will also require post-processing to produce data ready for consumption and interoperability with other linked data on the Web. While the conversions to BIBFRAME of a range of some 30 record types have been explored in concert with technical services staff at our three libraries, the ontology team has focused primarily on the availability and representation of data pertinent to the LD4L use cases rather than analyzing converter output across the board to ascertain completeness and correctness.

The classes and properties themselves in BIBFRAME, as well as some of their definitions, remain under active discussion on the BIBFRAME mailing list and in other venues.  With our project's strong focus on linking through to real world entities, we remain flexible in our interpretation and application of the BIBFRAME ontology, in some cases electing to use properties and/or classes in an LD4L namespace until such time as consensus has been reached through later releases or community practice. Fundamental questions will continue about distinctions between information and real world entities and conflicts between a desire to retain all the information encoded in MARC records vs. allowing bibliographic metadata to more freely inter-operate with other Web data.

Addressing complexity

Several levels of complexity may legitimately exist in parallel and be utilized based on the availability of data or the goals of an application.  This choice can be seen in PROV-O ontology where direct object properties have been paired with more complex options involving intermediate nodes that add additional temporal or role information.  The related PAV (Provenance, Attribution, and Versioning) ontology offers a simpler set of classes and properties sufficient for many applications requiring only simple attribution.  Application software can also often mask a more complex underlying data model, and in many cases it may be preferable in production contexts to separate logging and provenance information from user-facing applications entirely.

Working with non-MARC metadata

While our library catalogs are very likely the largest single sources of metadata, each partner university maintains a large number of digital collections representing a diversity of subject domains, size, and complexity. Several of our use cases involve connecting catalog data with these non-MARC sources, not only to provide a more unified search interface, but to be able to interconnect and cross-references sources that for now remain almost entirely separate.  The benefits go both ways, and the addition of sources outside the traditional library domain brings in yet more possibilities for value-added services enhanced by entity recognition and external links.  Prime examples of these non-library sources are Stanford's CAP, Harvard's Faculty Finder, and Cornell's VIVO.

Annotations and virtual collections

The first two use cases address user tagging and the ability of librarians or others to curate potentially very large collections of library resources through annotations external but linked to the bibliographic metadata.  Existing ontologies were identified that support annotations and the assembly and ordering of individual resources into collections.

Usage data

The fifth group of use cases explore including usage data to supplement library discovery interfaces and to inform collection review and additions. Here the team first explored a very granular model for capturing usage information from circulation-related events and other direct user interactions with library resources. On further investigation, however, this data proved not only to be difficult to come by but fraught with concerns about privacy, even when stripped of any directly identifying information.  Later discussions have focused on the compilation and use of a simple stack score as a measure potentially more comparable across institutions despite differences in size, discipline, population makeup, and other factors.

 

References

While by no means exhaustive, the team has found these papers useful.

  • The Relationship between BIBFRAME and the OCLC's Linked-Data Model of Bibliographic Description: A Working Paper.  Carol Jean Godby, Senior Research Scientist, OCLC Research, September, 2013.  PDF
  • No labels