Background

Libraries worldwide rely upon Machine-Readable Cataloging (MARC)-based systems for the communication, storage, and expression of the majority of their bibliographic data. MARC, however, is an early communication format developed in the 1960s to enable machine manipulation of the bibliographic data previously recorded on catalog cards. Connections between various data elements within a single catalog record, such as between a subject heading and a specific work or a performer and piece performed, are not easily and therefore not usually expressed as it is assumed that a human being will be examining the record as a whole and making the associations between the elements for themselves. MARC itself was a great achievement, eliminating libraries dependence on card catalogs and moving them into a much needed online environment. It allowed for the development of the Integrated Library System, or ILS, and great economy in the acquisition, cataloging, and discovery of library resources. But as libraries transition to a linked-data based architecture that derives its power from extensive machine linking of individual data elements, this former reliance on human interpretation at the record level to make correct associations between individual data elements becomes a critical issue. And although MARC metadata can be converted to linked data, many human-inferred relationships are left unexpressed in the new environment. It is functional, but incomplete. With each day of routine processing, libraries add to the backlog of MARC data that they will want to convert and enhance as linked data. In the last ten years, computer science has embraced the LOD pathway that demands more semantic expression of data (that supports machine inferencing). It has developed approaches to data and international standards that support the new environment in the form of the use of identifiers to link data and the international standard, Resource Description Framework, or RDF, for recording it. Redevelopment of the platform for expressing and communicating bibliographic data is needed to move libraries more firmly into the internet and web environment.

The development of the digital library, often based upon a digital repository, has further complicated the library environment. In addition to their MARC data, libraries have become curators of rapidly expanding collections of digital objects, data sets, and metadata in other schemas such as the Metadata Object Description Schema (MODS). These resources and their metadata are typically stored in digital repositories and become a parallel, yet separate, database of record. This lack of integration has caused great difficulties in consistency and maintenance as the concept of a single database of record has broken down. And even beyond these two repositories (the ILS and the Digital Repository), as libraries look to the future, they will be asked to step outside these more traditional materials to become the curators of the vast knowledge the university creates, in all its richness and diversity. Interactive scholarly works, unpublished data sets, information about faculty contained in profiling systems, metadata about learning objects, once integrated with more traditional library resources, will allow our faculty and students to explore our information resources and make associations that are impossible today. In 2012, the Library of Congress (LC) began a project to end libraries isolation from the semantic web through the creation of a new communication format, called BIBFRAME, as a replacement for the MARC formats. The development of BIBFRAME has been a complex one as its creators try to balance the need to capture the data encoded in MARC, the constraints of RDF, and input from the community it hopes to serve. In addition, there are other schemas available for libraries’ use, such as Schema.org, the CIDOC Conceptual Reference Model (CIDOC-CRM), and the Europeana Data Model (EDM). Although not designed as replacements for MARC, these other schemas are used by important information communities, such as Europeana or Museums, with which libraries interact. The resultant metadata ecosystem has created a very complex environment.

Schema.org itself deserves a special mention in this complex environment. Sponsored by Google, Microsoft, Yahoo, and Yandex, “Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.”10 It has been designed for the broadest possible use and focuses upon the semantic understanding of Web search engines. Because of this focus, it is of great interest to libraries and library-related organizations, such as OCLC, for embedding library data into the semantic web. It was never designed, however, to capture even the full richness of the data contained in MARC. Rather, its focus is on broad integration into the Web. BIBFRAME has been designed to fill that gap so that, as libraries move to the semantic web, the richness and detail of their metadata can be reflected there.

Likewise, the CIDOC-CRM has a special place in this project. Accepted as an ISO standard since 2006, CIDOC-CRM has been designed to encompass the full description of cultural heritage information: the objects themselves, their digital surrogates, and the metadata describing them, using either an object- centric or event-centric modeling. The schema is extremely complex and tailored to the world of museums and cultural heritage organizations. Often, libraries may need to describe some of these materials but it is not the focus of their collections. They do, however, need to encompass the description of these objects in their discovery systems. The LD4P projects focusing on these materials will experiment with expanding BIBFRAME to include necessary concepts from CIDOC-CRM to produce a simpler but functional extension to BIBFRAME that can meet the basic needs of describing these materials in a common discovery interface. Libraries have survived in their current environment by adhering to structural and data quality standards to facilitate the easy exchange of metadata for commonly held resources. These standards also allowed metadata from various institutions to be quickly combined into large discovery interfaces. As libraries transition from their current environment to a much more complex one based in LOD, these standards must be rethought and re-envisioned. Their need is still as strong but their expression is unclear.

Since its inception, BIBFRAME has been used in a number of individual projects both within the United States and internationally. For instance, the University College London Department of Information Studies has been awarded a grant to develop a Linked Open Data bibliographic dataset based on BIBFRAME. The Library of Alexandria will focus on the conversion process for data in the Arabic language. The National Library of Medicine has developed a more modular approach to the BIBFRAME vocabulary by paring down the existing vocabulary to its core concepts (BIBFRAME-Lite). We now have arrived at the point where these individual efforts should be drawn together to create the common environment, standards, and protocols that have allowed libraries to interact so strongly in the past. And by expressing relationships in a standard way so that machines can understand the meaning inherent in them, the heart of the semantic web, library’s data will finally be able to be embedded into the Web.

Rationale

In order to address these issues, Stanford University proposed a planning grant to the Mellon Foundation in 2014 called Linked Data for Production (LD4P). The planning grant proposed two meetings to define and organize a series of projects that would begin the transition to the native creation of linked data in a library’s production environment. The core members of LD4P are Columbia, Cornell, Harvard, the Library of Congress, Princeton, and Stanford. The outcome of those meetings was a report submitted to the Mellon Foundation in July of 2015. The group had a final meeting recently at the Library of Congress to formalize its plans.

This group of six libraries is particularly well suited to pursue this transition in technical services. Cornell, Harvard, and Stanford are founding members of Linked Data for Libraries and will be building upon collaborative efforts already well underway. The Library of Congress is the originator of BIBFRAME and is engaged in a project to explore the use of BIBFRAME in its current workflows. Columbia and Cornell’s Technical Services Departments are already allied through another Mellon-supported project called 2CUL. And Princeton was one of the early BIBFRAME experimenters in the United States.

Beyond this, however, these institutions are deeply enmeshed in the current technical services ecosystem. The transition to LOD cannot be accomplished exclusively by libraries. Libraries have become dependent upon vendor services (cataloging, authority control), the ILS, standards organizations (the Program for Cooperative Cataloging (PCC), and domain experts. As part of LD4P, these institutions can influence the vendor community as a group to encourage them to make the transition to LOD. They can work with their own ILS (SIRSI, Ex Libris, OLE) to incorporate LOD into future plans. SIRSI/Dynix has already expressed interest in working with Stanford on its linked data workflows through the use of their new product, Blue Cloud. OLE has already been actively engaged with UC Davis and the BIBFLOW project and plans on enhancing their linked data capabilities. Cornell has recently announced that they will be moving to OLE for their ILS. If they make the transition early enough in this grant cycle, they may be able to take advantage of OLE’s capabilities as well.

This new communal, distributed model based on web architecture will change how we communicate and share our data. Centralized data stores, such as the OCLC database, will be joined by alternative data pools as the marketplace shifts in support of this new environment. Traditional authority control will be supplemented by identity management. Cataloging standards may have to evolve from their focus on transcription of data as it appears on an item, which a human can easily read and interpret at a computer screen, to data that a machine can understand and link semantically.

Institutional Projects

The institutional projects are focused either upon the processing of special, local collections or the conversion of local workflows for more traditional materials. A library’s workflows are often particular to that institution. They develop organically from a complex mix of institutional policies, vendor services, choice of ILS (and its capabilities), and accepted standards (RDA, the Program for Cooperative Cataloging’s Bibliographic Standard Record (BSR), etc.). The goals, then, of the institutional projects are two. The first is more straightforward. Although identical workflows cannot be developed for all institutions, standards for the output of those workflows can be. This meets a library’s basic need to be able to ingest and use metadata created at other institutions. By focusing on different subject domains (e.g., cartographic, music, rare books), the group is trying to standardize the metadata output for the most common types of resources they will need to process. The second is more complex. Two institutions (The Library of Congress and Stanford) have chosen to look at the conversion of their local workflows to linked data as their institutional project. As all workflows are unique to that institution, they can be considered “institutional.” However, the benefits of these projects are numerous. First, they will demonstrate that the conversion of a workflow from acquisition to discovery is possible. They will identify the separate elements of the workflow that must be considered/converted. They will produce solutions to the various elements of their workflows that can become models for how other institutions could approach similar issues. And as they work through these workflows, they can do it in consultation with the other LD4P members so that common standards and protocols can be developed even if explicit workflows cannot be copied from institution to institution.

Early on, the members of LD4P discussed how best to coordinate their projects. Of prime concern was whether more synergy could be gained from working in similar subject domains or, instead, to focus on individual institutional interests. In the end, the group decided to favor institutional interests as institutional by-in and support would be key to local success. In addition, a major differentiator for LD4P as a project is learning how to work together in a networked, distributed environment. The development of this environment is independent of subject domain and so this helped to reinforce our decision.

That being said, a tremendous number of intersections have appeared across the individual projects binding us closely together. Columbia has chosen to investigate the extension of BIBFRAME for art objects as Stanford looks to ingest the metadata from its art museum, The Cantor , and both will have some intersection with the Library of Congress’s exploration of the use of BIBFRAME with its prints and photographs collection. Music appears as a theme in the project proposals from Cornell, The Library of Congress, and Stanford. Harvard has included a Stanford metadata expert in their exploration of geospatial metadata and BIBFRAME. The Library of Congress is working on the development of BIBFRAME 2.0 as Columbia, Cornell, and Stanford work on expanding its use into three new subject domains. The Library of Congress is also exploring the use of RDA and BIBFRAME, something that will be of use to all members’ catalog departments. Princeton’s project will build upon the annotation work developed for the first Linked Data for Libraries grant. And Stanford’s Tracer Bullet projects can help inform similar workflows at the other LD4P institutions.

Columbia

Overview

Over the years, the museum and library communities have developed separate descriptive cataloging practices even though many museums hold library objects and many library collections contain museum objects. Libraries have frequently used their ILS system and the MARC 21/AACR2 cataloging tradition to describe these art objects along with the traditional library materials. Now the library community is moving beyond the MARC record into the realm of linked open data.

This sub-project will focus on testing the BIBFRAME schema's suitability for the description of art objects, both two-dimensional (e.g. paintings, photographs) and three-dimensional (e.g. sculptures, ceramics). In addition, the Columbia group will evaluate other existing linked data ontologies (primarily from the art domain) not only to see how BIBFRAME compares to these specialized domain ontologies, but also as potential extensions to BIBFRAME where gaps have been identified. The test may result in the implementation of BIBFRAME as is, the implementation of BIBFRAME with extensions from other ontologies, or, potentially, the implementation of a different ontology (such as CIDOC CRM) if the test would show that BIBFRAME is not suitable at all for the art domain. The lessons learned would be shared with the community. Assuming that a suitable BIBFRAME art profile can be developed, the Columbia team is aiming to document the relationships, such as equivalent classes and properties, from BIBFRAME to CIDOC CRM. The project will utilize metadata descriptions created for Columbia University’s art objects, which are overseen by Art Properties at the Avery Architectural & Fine Arts Library. The collection in total numbers about 10,000 objects, including public outdoor sculpture, paintings, photography, works on paper and decorative works. At present, data describing the art objects is captured in a spreadsheet according to locally developed guidelines following conventions developed by the art community and using both Library of Congress and Getty vocabularies.

The group’s deliverables will include: a BIBFRAME profile for art objects, both for data transformation and native data creation; transformation and conversion of a representable selection of art object descriptions cataloged according to the Art Properties collection's local schema to the profile; and an evaluation of the project and publication of the group’s findings.

The Cornell team will support our use of the Vitro linked data editing tool for this project.

Objectives

Evaluate the suitability of the BIBFRAME model and vocabulary for describing art objects, both two-dimensional (e.g. paintings, photographs) and three-dimensional (e.g. sculptures, ceramics).
Identify and document any descriptive needs of art objects that are currently not covered by BIBFRAME.
Evaluate other linked data ontologies, including CIDOC CRM, and initiatives in the art domain.
Develop a profile for the description of art objects.
Convert a representative selection of art resources cataloged according to the Art Properties collection's local schema to the profile.
Engage with related projects in the museum/art library domain, including the Library of Congress Prints and Photographs Division.
Participate in data exchange with other interested partners.
Develop workflow to connect public facing linked data to MARC circulation and other inventory data in spreadsheets or other sources.
Evaluate the project results and share recommendations

Community

Columbia’s LD4P team (consisting of librarians from Columbia University Library’s Original & Special Materials Cataloging Division, the Avery Architectural & Fine Arts Library, and the curator of the Art Properties collection). Colleagues from the Library of Congress Prints & Photographs Division. LD4P partner institution members.Interested members of the art and art library community.

Deliverables

Develop a profile for the description of art objects.
Representative selection of BIBFRAME descriptions of art objects.
Workflow to connect public facing linked data to MARC circulation and other inventory data in spreadsheets or other sources.
Evaluation and publication of project findings.

text

heading

Space shortcuts

Page tree

Background

Rationale

Institutional Projects

Columbia

Overview

Objectives

Community

Deliverables

Space shortcuts

Page tree

Project Proposal (January 2016)

Background

Rationale

Institutional Projects

Columbia

Overview

Objectives

Community

Deliverables