• Title: Objects can be associated with a PREMIS event service
  • Primary Actor:
  • Scope:
  • Level:
  • Story: An object can be associated with a listener for PREMIS events. This listener should receive event message rather than whole documents, but be able to export a log of the event history. Attempting to emulate this in Fedora 3.x requires constant replacement of an XML document, and has collision problems when integrated into a system of parallel, distributed repository processes.

 

12 Comments

  1. A. Soroka asks: "Ben-- under "Flexible types of stored entities", are you contemplating a meta-contract that would be exposed to the community and its developers for people to fulfill in order to create new providers for these kinds of entities? E.g. someone (J. Random Fedora Institution) who wants to add a "PREMIS module" would be able to fulfill a certain meta-contract with the repository architecture and interfaces and guarantee themselves of good operation? Or is this about work that we might do near-term in the core of the project to actually produce some of these new kinds of providers?"

  2. Does this story also include how these entities are store (e.g. non-filesystem storage)? At Stanford, we store workflow data in a RDBMS (and poke it through the Fedora object as an externally managed datastream). It'd be (interesting, at least) if Fedora could provide read/write/(append?) access to the data, storing it in the database (and somehow make static copies for preservation purposes?)

    1. Adam: Yes, the first; exactly

      Chris: Yes, the particular use case I have in mind is this kind of event logging- something that non-locking replacement of bytestreams can't really serve effectively in a high-traffic environment.  But ideally, the meta-contract (borrowing Adam's term above) can be used to implement arbitrary "types"

  3. Extending the first point - arbitrary types are essential once we get into data. Use-case: in experimental biology they have over 200 different XML standards for different experiments. Each object may some common elements but must essentially self-describe as we can't and don't want to have to deal with all those schemas and versions - tools like Elastic Search allow us schema-less indexing and the responsibility for validation passes to the supplying application(s).

    For workflows I would view the repository version as the canonical one and the database as a "working cache" of the information which can be recreated from the repository (if it can't then it's not archived correctly!).

    For high throughput applications we use a message-queue driven architecture at Oxford (actually for most things). Modules listen for messages that indicate updates to objects of interest or the creation of objects of interest. The messages contain a reference to the object on question and then interact with it via the repository API. We can thus have multiple modules to process messages if we have throughput issues, or a single module shared among several stores if it scales well.

     

  4. Ben, can you re-frame your use case to make a clearer case for the problem that this use case is intended to solve/address? If indeed it's about "meta-contracts", then pick one and use that to illustrate (possibly cribbing from Neil, if that suits)

  5. For my purposes, all XML is one type of object. A better example would be an "Events stream" that, under terms of "meta-contract", should be able to serialize events to some export format (probably XML).

  6. I'm very interested in the topic of Fedora storing PREMIS data. I'd like to also comment that it would be useful to be able to prepend prior existing PREMIS event data. I'd like it to be developed so that at any given time I can make a restful request to Fedora and return the full PREMIS record for an object. As well as use a restful service to apply event data to an object. But most of all, I would be looking for Fedora to be the PREMIS authority. PREMIS records could become very large, I tend to think this lends itself better to a SQL solution than a no-SQL solution. I don't like the model of having an XML doc stored and then replaced for each event. Checksums are events that would need to be tracked so the creation/addition of an XML file for every object for this event seems like a unnecessary bottleneck in the amount of time it will take to re-run fixity checks where adding a single line to a SQL table is not. 

    1. Michael, I'm a bit skeptical of Fedora storing and presenting PREMIS data per se, because the implementation of PREMIS requires establishing (or choosing from existing) vocabularies.  It seems to me that it's more important for Fedora to record "create", "modify", "delete", and "validation" events which can be converted into PREMIS (XML) via some other process.

      1. David,

          I have mixed feelings about this- On the one hand, I hate to tie a product down to a particular implementation.  On the other, a generic vocabulary is both idiosyncratic and can be, if it is not too harsh an assessment of Fedora 3.x, basically useless without a driving use case behind it. Given that the PREMIS vocabulary exists, and does what you describe, and can also be transformed into other formats as well as a local scheme could, I'm not sure I see the merits of idiosyncracy.  I also remain sympathetic to Michael's stance re: events vs xml documents, for what its worth.

        1. Ben,

          I suspect there's a significant area of agreement here. I suppose it's a matter of clarifying exactly what "a restful request to Fedora and return the full PREMIS record for an object" means.  PREMIS in many instances recommends using controlled vocabularies without prescribing them, which means Fedora has to define them.  Not that this would necessarily be a bad thing, but I worry a bit about differing opinions and institutional practices.  OTOH anything in this area would be vast improvement over the FC3 audit trail. (wink)  And yes, events, not XML docs.

          1. I had to think about this more before replying. I think in the simplest form I'm looking for a restful service that will store key->value pairs for data streams and then on demand, list them back out in the order they went in. The problem I'd like to solve is to have an authoritative system to store PREMIS event data, not necessarily be capable of creating a full PREMIS record which could be derived from the data stored with the Fedora object. I think the simplest implementation could treat this much like EXIF data where the keys are up to the institution and the values can contain up to about 500 characters of data though the majority will be under 50. 

            But taking the simple approach and expanding it a little further, it would be nice if Fedora also contributed to this, so anytime a data stream is requested, that's an event that is stored. Going a little further, this is only needed for some data streams, so being able to flag which streams have this action applied would help with the default being no events recorded. An example would be when we ingest an AIP that contains a dozen files including a TIF, I only want the events for the TIF, not the other 11 files. 

            Some events take place prior to the object getting into Fedora. In actuality, they are the most important events, that's why it would help to be able to register events with an object after ingest. It might even be helpful to allow the date to be sent along with the key/value pair. 

            The events I'm most interested in recording are:

            • a derivative file was created
            • a file was copied
            • a file was accessed
            • a file was moved
            • a checksum was calculated for a file
            • a new version replaces a file
            • a file is suppressed
            • a file is deleted
            • a file is added

            So the key/val pairs would simply be:

            • derivative=>JPG 1056*
            • copy=>source/destination
            • access=>[no value]
            • move=>source/destination
            • checksum=>MD5:[checksum]/Sha256[checksum]
            • version=>old file name/new file name, old checksum, new checksum
            • suppress=>[no value]
            • delete=>[no value]
            • added=>[no value]

            It might be worthwhile to code a way to associate a userID with the events as well. For Yale, these are all system to system events so we may pass in the value "user X did this from system Y" where user X does not exist in Fedora and could be almost any value from a full name to just a network ID. 

            But the keys are arbitrary, some defaults could be used from the PREMIS spec on event data but if Fedora recorded the above by default for data streams it would get me 99% of the way to where I need to be and the last 1% is really to download the data and create the PREMIS record, so in reality this would be 100% of what I need Fedora to do. 

            I hope that helps to clarify my request a little more after giving it more thought. Right now I have a system that creates this data ahead of the objects going into Fedora. I don't want to send it as another XML data stream, I'd rather pass the info from the originating system to the system of record and from then on store it only in Fedora. I also recognize that this could amount to a lot of data. But when it comes to accounting systems, I'd rather sift through a thousand lines of logging for a single data stream and find what I need than to have to go through a billion lines of Fedora system logs trying to figure out when something happened to a specific stream. It also helps to identify blame. If there's no event for something but something clearly happened, then we know it was not the fault of the software and have to look at the hardware and systems around them. If there is an event and we can trace it back to a person that did something in a system that feeds Fedora, then that would help me sleep at night.

            1. Hello Michael Friscia,

              Thanks for the clear and scoped description. The collection and population of this information on objects or datastreams is eminently doable with the existing Fedora internal event monitoring machinery. It would be valuable to do some collaborative design around import of existing event properties and the REST API for retrieval as we get closer to wrapping up the 4.0 feature set.