Contents

Thursday Notes

26-October-2006, Richard Jones

Outline of the day

Morning Session

Data Model

RT proposal for data model:

Discussion outline:

Summary:

We moved rapidly from the initial proposal to a more general approach. Diagram 2 shows the distilled approach where metadata is attached at each level, and the Item is equivalent to the FRBR work and the Representation is equivalent to the Manifestation. It was also articulated that we still need to represent relationships beween things in this model, and whether this sits in some sort of Item Manifest is one question that was asked and debated. The sorts of relationships that might be necessary were enumerated, and the benefits of adopting something like the DCMI terms was suggested, but largely rejected because it was too specific. Further discussion of this topic continues below ...

RR proposal for data model:

Discussion outline:

Summary:

RR presented us with a much flatter data model, where structure is managed by metadata attached at the bitstream/file level. The concern with this is how you: a) reconstruct the bundle like structure of the item and b) where you attach your identifiers for each contained thing that we might want to reference. There was further suggestion of an item manifest, but this only easily addresses issue (a). It was also finally observed that both RT's and RR's proposals were pretty much identical in terms of information architecture, but would lead to different sorts of implementations, especially that RT's proposal led to a more cocrete interpretation of the structure. At least we now understand the abstract concepts that we need to express, and our requirements for them, which is beyond what the DSpace system can currently achieve.

Use Cases:

RT takes us through the use cases from the wiki DataModelUseCases. Use cases brought up and updated have been added to that wiki page.

VRA (for visual resources)

  • Group
    • Work
      • Image
        • Surrogate

Aggregation Discussion summary:

Summary:

For the most part this consisted of the formulation of the primary recommended data model for DSpace. We looked into the balance between extremely general ways of managing hierarchical information in Nodes and Properties (exemplified in this case by JSR-170), and the extremely specific requirement of providing a useful DSpace information model which could be implemented easily, and quickly communicated and understood without too great a departure from the existing information model.

Making a decision:

Here we attempted to make the final decision on the recommended information model to be worked on. Below summarises the main points raised:

JE: must be possible to support the future options of DSpace implementers, irrespective of our current data model

MS: more turn-key than fedora, less so than greenstone. We can't be so abstract, but we can't always hand-hold

MS: we must make our recommendation on the concreting of the N+P, so that we can write the application, implement inside the data model, and so it can be explained to people to ensure that it can be appropriately applied.

JMO: this is close enough to the current structure: easily explained, adopted, but has more flexibility so that people can

RT: we should go with this right now, and worry about generalising more later on.

MD: how will this model be abused?

JMO: should be possible to misuse if you know what you are doing

RT: versioning in the container model

MD: if versioning is at the item level; for a work with many images, the versioning must be at the images; don't want to version the entire work of the item

JMO: we are converging on diagram 2. Metadata and content are managed together.

MS: do we switch to more FRBR speak? Representation -> Manifestation (agreed)

GT: File doesn't work, because you can't necessarily put every asset in the assetstore

GM: Can EPeople fit into this scheme?

JMO: name changing bitstreams implies functional change, even if it is not. Can we just live with bitstream?

SP: everyone knows what file means

JMO: nominations 1) bitstream, 2) datastream, 3) file, 4) resource

We do a preliminary round of voting for Bitstream name change, where we can each vote as many times as we like:

Aggregate Votes: File: 6; Bitstream: 4; Resource: 4; Datastream: 1 (JE wants to be on the record for that being him!)

SP: nominate: Bits; gets 4 votes

Next we vote on whether first to switch from Bitstream to something else, then second vote for our other options, with just one vote per person:

Single Vote: Change vs Not Change: 9 to Change
Single Vote: File, Resource, Datastream, Bits: 6 for File

Concrete Data Model (making the data model more real):

Use cases:

1) A PDF (article), with deposit licence, extracted text

Submit: PDF + metadata, agree to a licence. Create a manifestation containing the single file. The deposit licence and reuse licence are metadata, but they may be represented as files ...

This discussion has currently ceased because of an argument over where metadata "files" live. It will be returned to later

Lunch

Afternoon Session

Event Mechanism

RR introduces the general framework for an Event system

Content Model ---> Event ---> Dispatcher ---> Consumer ---> Browse/Search/JMS ...

LS arrives

Summary: general consensus that this is good, and should be got into the source code as soon as possible. This will allow us to build a History system as consumer, and therefore this enables the advancement of that system.

  • Timeline:
    • Context created (includes db connection)
    • Operations cause changes to happen to the db connection
    • events are stored
    • context.complete() - transaction with database is completed
    • events are then fired to dispatcher
    • DSpace context goes out of scope

DECISION: yes! recommend that this will go into 1.5 initially, and become "core" for 2.0. Exact design for 2.0 will depend on framework review, and upcoming event prototype, and there will be a new logical event.

QUESTION: MS: Do we implement a new History system or not? No one is pressing for it, but it could be important. RR: is the History system a detachable, optional thing? We have an architecture that permits this for the first time, so pull it out of the core. General consensus that that is a good idea.

LS: every archivable object needs to expose a URI to be used successfully in any History system
MS: PREMIS is not flexible enough for us to use as an events framework

Question: do we ship with a history system, or leave it out. RR: should do so, and MIT will be on the hook to provide it. JD: provenance is so important these days.

Return of the Data Model

JMO summary: we have a storage layer, which may be in a database, but not necessarily, AND we have a database cache which is a view of that data as appropriate.

Two possible structures for storage are presented, as presented in the below diagram:

back to use cases

1) as before. see above

2) The PDF is converted to HTML with GIF

ISSUE: (GT): "derived from:" does not tell you whether the source that it has been derived from (using the version identification schema) has been changed since, given the global version identifier update that happens when a item update happens.

NB (GT): IMHO this is something that needs to be enshrined in the data model. While it's theoretically OK to call this curatorial, the system has to be able to efficiently identify when derived manifestations are potentially out of date - not doing so threatens the preservation model ('dumb' version links would make every existing derived manifestation for an item appear to be potentially out of date whenever any revision is made - such as adding a new derivation). However, we did appear to have a consensus on 'smart' version links that update with revisions providing the manifestation pointed to hasn't been altered - which would resolve this issue.

RESPONSE: This is not the burden of the data model, this is a curatorial operation. Perhaps the Event system can help here?

IMPORTANT: revisions are cheap; have to have a way to hide them in the UI. Must also be able to distinguish major or minor versions.

DECISION: generalise containers, and investigate the long term needs to have containers that contain different sorts of objects