MikeSimpsonThoughtsDuringConference

ANDOM CHONOLOGICALLY-SOTED THOUGHTS OCCUING DUING THE 2004 USE CONFEENCE
(as refactored at 20,000 feet, courtesy of Midwest Airlines)
Metadata should probably have a separate authorization from the
content. I.e. you should be able to set a switch that says, "Show the
metadata for this object, but not the object itself."
Another kind of record that should be attachable to a digital object
should be a statistical ("usage") record – this is different from
regular metadata in the sense that it should be composed of counters
that automatically update when triggered by API actions. So when the
interface calls Item.display() (or whatever) the "display count"
attribute gets autoincremented. These counters should be implemented
at the lowest level possible (like the UNIX kernel counters for I/O,
virtual memory, etc.) so that all sorts of statistical analysis can be
built on top of them. Should the statistical functionality wind up as
a separate module ("the Statistics API") or be part of the APIs of all
modules (everything has a getCounters() call or something like that)?
Actually, I think you could do all this with extensive logging plus a
log analysis toolkit (i.e. Apache+Analog). It's where you want to put
the overhead.
Here's a concept and a vocabulary term hanging on it: "viewpoints",
which are retrieval interfaces customized for specific data types
(Text, Image, Audio, Video, Export). This is sort of
Model-View-Controller, in that we split display from data. I suppose
you should be able to request the Image viewpoint on a textual
bitstream, but I'm not sure about what it should return. Should
viewpoints attach only to certain levels in the object scheme, i.e. do
communities need to have viewpoints?
All objects are container objects. Even a "itstream" is really an
abstraction of bytes plus extra information. All objects should be
loosely-coupled to their containers – maybe objects exist as
non-hierarchical "pools" (bitstream pool, item pool, collection pool,
community pool) and then we express hierarchy as arbitrary linkages
defined between pools – this is like a filesystem abstraction sitting
on top of a physical storage device, where branches (directories) and
leaves (files) are structure imposed on randomly-distributed disk
sectors.
etrieval interfaces should respond to (at least) two types of
identifiers: something a bit more handle-ish (the "canon path") and
something that reflects a (human-readable) pathway to the object
through a specific hierarchy of communities/collections/etc. (the
"label path"). There should be one unique "canon path" for each
object, but many possible "label paths" (compare to symlinks in a
filesystem).
Any object should be able to be aliased into the next-higher-level set
of containers: this action would create a new "label path" but not
another "canon path". Maybe the original label path (the one created
upon item submission?) should be privileged somehow – the owner of
that path can grant/create other paths, but no one else. Or, there's
an owner of the "canon path", who can then grant/create one or more
label paths with various authorization parameters upon request.
I.e. an interface to send a request: "I see you are the owner of this
item; I'd like to include it in my collection as well, and I'd like it
to be publicly-accessible along the new label path." And then one for
the reply: "The new label path has been created, and I've put this set
of authorizations on it."
Instead of "label paths", they could be called "alias paths".
Any object should be able to have metadata records ("record objects?")
attached, of various types (i.e. DC record, METS record, MAC record;
but also "Collection" metadata, "Community" metadata, i.e. the
information that is currently pulled off into the collection and
community tables in PostgreSQL; this should be generalized and turned
into a metadata record just like any other metadata). Note that being
able to attach multiple records of the same schema type fixes the
language issue (English DC, French DC, etc.).
Metadata schemas themselves should be loadable/unloadable based on an
XMLish definition file; actually any kind of registry-type information
should exist in canon form in XML, which is then parsed by the loader
process and turned into the appropriate internal commands (i.e. ANSI
SQL) to create the necessary data structures in the metadata store.
Authentication creates a session object with various "attributes"
attached to it; attributes are the keys that provide authorization
decisions during retrieval. Authorization should occur for retrieval
of any object (communities down to bitstreams). An "authorization
path" (what I called an "alias path", above) might define the sequence
of authorizations that must be passed successfully to do retrieval.
That does mean that different authorization paths could exist for a
single object. Maybe the final arbiter is the canon path,
i.e. authorization parameters set on the canon path are checked last,
and override all other parameters set for the other authorization
paths.
I'm thinking that indexing/search functionality really has no business
in DSpace, which is about archiving, browsing and retrieval. Index
and search should really exist as a separate application that lives
above the DSpace layer. The browse/retrieve API could of course
extend useful functionality up to that service layer (i.e. OAI-PMH)
but indexing and searching directly inside DSpace will always be a
secondary function at best.
An absolute baseline definition of "digital object": "a stream of
bytes representing a discrete chunk of intellectual property."
It would be nice if each defined MetadataSchema object could
automagically imply an OAICAT crosswalk, if the appropriate XML
mapping descriptor is available. Which is to say, there could be a
"crosswalks" directory with XML files inside describing various
mappings, and DSpace on startup would populate the OAI-PMH interface
with the appropriate crosswalks based on the XML files that it found.
etrieval limits should be able to be specified both in terms of
number of records and/or size of delivered content (i.e. "give me ten
records, or 100 Mb, whichever is less.").
Persistent naming structures (handles, PULs, AKs) should be
conceived as plugin views for the exposure and retrieval of repository
content. DSpace should always maintain an internal canon identifier
that can be algorithmically transformed into any of the other
identifier types and exposed to harvesters et al. on demand.
andom metaphor: DSpace is a hammer; we haven't even started building
the cabinets yet.
More vocabulary possibilities: the canon path could be called the
"identification path", vs. the alias path(s) which are
"authorizations path(s)" to the object in question.
Another type of record that it should be possible to attach: a
"licensing record". Our container objects (bitstreams up to
communities, or maybe even Instances of DSpace) are starting to look
like convenient abstractions composed of a persistent identifier that
gives us a hook upon which to hang metadata. Which matches perfectly
the DSpace fundamental questions: "Where is it?" and "What is it?"
andom metaphor: individual users produce archipelagoes of knowledge.
Interfaces extended to the service layer by various modules should
probably express themselves (or be able to be expressed) as XML. The
service layer could then search for appropriately named/identified
XSLT transforms and use them for display. I.e. when you do a
"retrieve" action on item "foo", using the authentication path of
collection "bar" and community "bat", the service layer grabs the
appropriate XML, and then applies transforms for "/bat/bar/foo", if
available.
Plugin modules (a la Apache) are just a fabulous design (obert
Tansley's presentation). It would be EVEN ETTE if the modules were
stackable, i.e. for authorization, several different modules may all
be registered at startup as "interested parties" (providing the
requisite API(s)). Then a call to Item.authorize() is really a set of
calls falling through the stack of registered modules. Each module
has a chance to either handle the call itself or decline the call,
which then falls through to the next module in the stack. Apache does
something almost exactly like this, I believe.
Toolkits become services through the application of policies and
procedures for use. Don't get confused and start burying policies and
procedures inside your toolkit code.
The question is never, "What can I do with Apache?" The answer is
always, "Yes, I can use Apache to do that." DSpace should strive for
something similar.
It would be nice to consider writing "mod_dspace" for Apache, which
would provide a nice interface layer down into the DSpace services.
This ties Apache's down-growing roots into DSpace's up-welling
springs, and lets us stick a pin in the content and answer all three
questions: "Where is it? What is it? How do I get it?"

Page tree

MikeSimpsonThoughtsDuringConference