Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Content Models: Getting Started

The first problem when thinking of content models is probably trying to nail it down and come up with a definition. So, here is a first step in this direction: A content model can be seen as a combination of

  • a structured definition of a type (e.g. article, book, image, ...)
  • a set of constraints
  • a pattern of datastreams (number and type)
  • a pattern of datastreams and disseminators
  • a subclass of a digital object
  • a set of rules for creating a new object

Typically, a content model relates to one object type, not so much to a whole repository. Fedora is very flexible when it comes to content models. While this is one of its eminent features, at the same time it burdens the repository designers with the decision on how to set up a content model suitable for their needs. Object models can be viewed on four levels:

  • Typing of objects (data formats)
  • Structural definition
  • Service definition (disseminators)
  • Logical model (aggregation of all content models within a repository)

Fedora currently doesn't support explicit content models which are strongly typed. This allows setting up a repository without explicit content models. While this may look intriguing to a inexperienced user, it typically leads to a messy repository which may conceal the full value and richness of its digital artifacts caused by the lack of systematics.

Some kind of informal typing can be achieved by the CMODEL property, but this is rather based on a 'best-practice' approach than on prototypes and automated validation. Many projects have implemented some kind of validation mechanism (one of the most advanced is certainly the work of Kostas Saidis and George Pyrounakis of University of Athens with their Pergamos project).

The Arrow project (http://arrow.edu.auImage Removed) spent a great deal of time thinking about and documenting content models.

There are (at least) two dimensions to the thinking around content models.

The first is the difference between a compound object model vs an atomistic model. The second is the way in which metadata is used to describe different kinds of resource types.

Compound vs. atomistic model

Early in the project, ARROW committed to the compound object model rather than the atomistic model. What we mean by this is that there is a 'knowledge entity' which may have lots of different components. A book may have several chapters. A thesis may have a body, some tables, some images, a movie file, and an appendix. One knowledge entity = one object = many datastreams. (The atomistic model would separate these out into lots of different objects and link them together later.)

We started thinking in terms of contentstreams and metastreams, and how to link one contentstream to its corresponding metastream. (My terminology, which I hope is self-explanatory)

One object might contain one pdf contentstream, and one image contentstream. Expanding this object to include its metastreams might look like this:

...

PID

...

Descriptive label

...

Format

...

Comments

...

DC

...

Dublin Core

...

text/xml

...

Descriptive metadata applying to entire object

...

DS1

...

MARCXML

...

text/xml

...

Descriptive metadata applying to entire object

...

DS2

...

Chapter 1

...

application/pdf

...

A pdf contentstream

...

DS3

...

DS2metadata

...

text/xml

...

A metastream describing ONLY DS2

...

DS4

...

Image

...

text/xml

...

A jpg image contentstream

...

DS5

...

DS4metadata

...

text/xml

...

A metastream describing ONLY DS4

...

DS6

...

Relationships

...

text/xml

...

A datastream describing the relationships between each of these datastreams

Being able to specify metadata for a specific contentstream has implications for things like controlling access at the datastream level, displaying file size and other attributes for specific contentstreams, applying keywords or other descriptive metadata for specific contentstreams. It allows you describe and manage individual components as well as the total knowledge entity.

Content models for different resource types

ARROW repositories are designed to manage the research outputs of universities in Australia. Each ARROW institution has its own individual and unique repository. (These are brought together in one interface through the National Library of Australia's Discovery Service http://search.arrow.edu.auImage Removed). The original partners (Monash University, University of NSW, Swinburne University, and National Library of Australia) formed an Implementation Managers Group. This group worked on a series of standard resource types, and modelled those. We agreed that MARCXML was an appropriate schema for describing the mostly bibliographic objects that were to be ingested. From a MARCXML file, ARROW could extract standard Dublin Core for harvest.

We worked on standard MARCXML for: books, book chapters, conference papers, journal articles, theses, working papers. Later we added images (although strictly speaking these aren't resource types per se.)

We put all the agreed MARCXML metadata elements into one spreadsheet, which we call the Centralregister. We added the following information for each:

ContentModel, Contentmodel Name, Data element, Repeatable, Mandatory, Element Label, MARCXML, DC/OAI-PMH mapping, Comments or Rules, Harvest recommendation by the National Library of Australi (for the Resource Discovery Service), Display Y/N, and finally, whether the element was common to all resource types Y/N.

Using Excel's autofilter means that this spreadsheet could be used to see which elements were common to all the resource types (eleven elements are), which are unique to specific models (such as Conference Location), which of the elements appeared in what models, and so forth. It is a handy tool.

Knowing which of the metadata elements are common means that you can build a front end to elicit the appropriate elements for any resource type, then you can customise the remainder according to the resource type.

Once we had the elements pinned down, we then built xml templates. Our users can fire up the Journal Article template in any XML editor, and fill in the appropriate bits. Load them into Fedora using the VITAL client from VTLS, extract the Dublin Core automatically, and you have a pretty standard object. It is a start. More resource types will be added as we need them.

The Centralregister.xls spreadsheet, as well as the seven resourcetype XML templates will be attached when I discover how to do so! We welcome your suggestions and comments - please arrow@arrow.edu.auImage Removed