Date: Fri, 29 Mar 2024 02:01:50 -0400 (EDT) Message-ID: <1240069277.29814.1711692110635@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_29813_1379677477.1711692110635" ------=_Part_29813_1379677477.1711692110635 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
See also:
I've "top-posted" this so that it gets read; the rest of the informa= tion below is still relevant, but it's pretty well established. I should no= te that code from patch \#1690912 could be used in this work. --JR
Update (25/06/07): _I've made considerable progress on = the code, and both internal and external identifiers are supported in a ver= y general fashion. I'm confident that with very little effort, it should be= possible to plug in arbitrary identifiers for objects. I've stripped the c= ode examples from the page, as they were out of date and misleading. The co= de is available as part of the DAO prototype branch of the dspace-sandbox r= epository on Google Code: http://code.google.com/p/dspace-sandbox/source\_ --JR
There has been some discussion recently about how to abstract the persis= tent identifier mechanism used by DSpace to something more "pluggable". Cor= e requirements seem to be:
DSpaceO= bject
Here is a very rough draft of how I think we could handle persi= stent identifiers.
I've had a go at defining exactly what is required of a persistent ident= ifier, and how this might map to the API outlined below:
NS VALUE PROTOCOL BASE
hdl 1234/56 http hdl.handle.net
doi 1234/56 http dx.doi.org
purl 1234/56 http purl.oclc.org
ln-s 123456 http ln-s.net
The ln-s.net example is a bit artificial (I don't think anyone would use= a public url shortening system as a basis for persistent identifiers), but= it demonstrates that we won't always be able to break a
VALUE= pre>
up into
prefix/= suffix
.
Ideally, value should be treated as an opaque string. Just URL-=
encode it when putting it into a URL. Unfortunately, many HTTP implementati=
ons are buggy when interpreting escaped "/" (slash) characters, and end-use=
rs will be sloppy about transcribing them, so any Web interface has to take=
slashes
in the Persistent ID's into consideration. Maybe we should encode this som=
ehow with a predicate method, PersistentIdentifier.hasSlashes()
so servlet code would know what to expect. LarryStone 22:24,=
16 May 2007 (EDT)
Note that Handles (and DOIs) are resolved through the Handle System =
protocol which is distinct from HTTP. The hdl.handle.net
prefix is just an HTTP proxy server. This is actually an advan=
tage of Handles; anything that depends on HTTP is not very "persistent". Ha=
ndles and DOIs will outlive the Domain Name system and Internet protocols.<=
/em> LarryStone 22:24, 16 May 2007 (EDT)
org.dsp= ace.content.uri
org.dsp= ace.content.uri
New packge containing all the classes and interfaces for managing identi= fiers.
org.dsp= ace.content.uri.ExternalIdentifier
org.dsp= ace.content.uri.ExternalIdentifier
I'm not sure about
getLoca= lURL()
. I don't really think it should exist, since it isn't a persistent iden= tifier (by definition), and as such has no place here.
Update (25/06/07) I've worked around this problem b= y making a clean separation between internal and external identifiers. Only= internal identifiers may know the "local" URL of the object. External iden= tifiers only know what the "global" URL is --JR=
The
getMeta= data()
methods are there to support systems similar to the Handle System that a= llow arbitrary name / value pairs to be store with the persistent identifie= r record. If such a mechanism isn't supported with a given implementation, = we can always just return
null
.
For now, I have put a simple configuration option in
dspace.= cfg
:
plugin.sequence.org.dspace.content.uri.ExternalIdentifier =3D \
org.dspace.content.uri.Handle
This is where the list of supported external identifiers will be kept. I= ssues arise when objects have more than one option for what to use as an ex= ternal identifier. A few simple scenarios are:
I propose a simple switch from (eg):
http://pomona.hpl.hp.com/dspace/handle/1234/56
to
http://pomona.hpl.hp.com/dspace/resource/hdl/1234/56
It has been suggested that maybe OpenURL-style URLs may be the way forwa= rd in this respect.
All versions of Tomcat before version 6 (I think) have issues with URLs = containing escaped slashes ("
%2F
"). This will cause problems for a lot of people if we want to stick to = the spec and escape slashes in our identifiers (though legal for use in the= URL, they have semantics attached to them that make their unescaped use in= identifiers technically incorrect). The fix was backported from version 6.= 0.x to 5.5.x in revision 507117 of the Tomcat SVN repo (details here).= Looking at the tags in the Tomcat SVN repo, it looks like 5.5.22 was the f= irst release after this fix was applied. I'm currently running 5.5.23 and i= t is still broken, so I think the default behaviour is still to throw the e= rror, but you can set some configuration to allow the escaping.
Here is the relevant code from
tomcat/= connectors/trunk/util/java/org/apache/tomcat/util/buf/UDecoder.java
:
Boolean.valueOf(System.getProperty("org.apache.tomcat.util.buf.UDecoder.= ALLOW_ENCODED_SLASH", "false")).booleanValue();
See also:
Bitstre= am
Bitstre= am
Currently, DSpace only assigns persistent identifiers to
Communi= ties
,
Collect= ion
s, and
Item
s. Among other things, this means that we currently have the rather bogu= s method of accessing the files associated with an
Item
:
http://pomona.hpl.hp.com/dspace/bitstream/1234/56/x/fileName.foo
where
x
is the (internal) sequence id of the file itself. What would be far bett= er would be for
Bitstre= am
s to have their own URIs so they were resolvable in the same way as anyt= hing else. After a discussion with Richard Jones, Claudia J=C3=BCrgen, and Stuart Lewis, it beca= me apparent that this would require all sorts of shenanigans to implement (= including versioning for
Bitstre= ams
, as well as a withdraw / delete feature similar to
Item
s). This isn't necessarily a Bad Thing (in fact, it's definitely a Good = Thing), but it means that the implementation would need to be thoroughly co= nsidered, and implementation on top of the current information model would = be tricky.
Many issues here:
Some people don't want to use Handles.
Handles are rather deeply ingrained in the system:
Should bitstreams get persistent IDs? At the moment they just get semi-p=
ersistent URLs, which include the item Handle and a sequence number; e.g.: =
https://=
dspace.mit.edu/bitstream/1721.1/4852/6/toc.pdf. This is probably someth=
ing that should be configurable.
Should be possible to play with Handle system "properly".
DSpace's integration with the Handle server is rather clunky. DSpace 'pr=
etends' to be a storage backend implementation for the Handle server, and D=
Space is actually managing / administering the Handles itself. This basical=
ly means a Handle server running this way can really only resolve Handles f=
or things in that DSpace, and do nothing else. This approach was taken for =
reasons of expediency and efficiency, and to make setting the whole thing u=
p simpler.
This would involve building a Handle client for DSpace that requests new H=
andles from a Handle service, and updates the Handles using that service wi=
th the appropriate URL (and maybe other metadata).
To understand the ID decisions in DSpace right now, one really needs to =
understand the object model and reasoning behind it.
The basic reason behind using Handles is that they identify an abstract in=
tellectual object and will continue to do so over time, despite the state o=
f the underlying 'physical' object changing =E2=80=93 underlying formats, c=
ustodian, location, perhaps being stored in multiple locations, access/deli=
very mechanism and so forth.
Given this, we made the decision to give the following things Handles:
Communities =E2=80=93 the original intent behind commun= ities was to group together content and services pertaining or of interest = to a particular group of people; e.g. a department in a university, or perh= aps a subject area. Although over time, such things are perhaps more likely= to really change than individual intellectual works, it was felt that they= were still worth preserving.
Collections =E2=80=93 groups of related Items. Some may= change a lot (in terms of the intellectual works they contain), many won't= .
Items =E2=80=93 these are intellectual objects or 'arch= ival atoms'. In my opinion they correspond to 'expressions' in http://www.ifla.org/VII/s13/frbr/= frbr.pdf the FRBR model. The underlying manifestation of these might ch= ange over time (e.g. MS Word to PDF to PDF/A) and there might be many manif= estations of a particular item available at one point in time. The intentio= n is that if in 50 years time I come across a Handle, I will want to get to= a manifestation appropriate for consumption using contemporary technology = rather than the particular version of Microsoft Word or media player etc. t= hat was used to create the object in the first place.
Below that in the DSpace object model, we have Bundles, which were origi= nally intended to correspond to manifestations (e.g. HTML w/GIF image manif= estation, PDF manifestation etc), though in practice this isn't how they've= been used (mainly because not many people have multiple manifestations in = a single Item, since the software isn't particularly mature in dealing with= that yet). In any case, the bundle structure of an item is likely to chang= e over time, and individual bundles are only like to be of use for a limite= d period of time. Hence the idea that a Handle for a Bundle/manifestation i= sn't appropriate.
Then we have the bitstream, which is a constituent part of a bundle/mani= festation. While there is a possibility that one might want to refer to a s= pecific bitstream, i.e. a set of 1's and 0's that will never ever change, t= he administrative overhead and costs in assigning every bi= tstream a Handle was deemed to much. If a single bitstream is likely to be = thus cited, it can be 'promoted' so that an item contains nothing but that = bitstream and then the Handle for that can be used.
A complication which has led to the situation where use of Handles is ra=
ther embedded in the DSpace code, is the desire that these persistent ident=
ifiers be resolvable. i.e. that the identifiers can just be shoved into a b=
rowser's address bar, and bang, something useful for the end user happens. =
(This is as opposed to having to go to some search tool, say a DSpace insta=
nce or Google, and enter the ID there to find the identified object). Handl=
es in their native, URN-syntax form (e.g. hdl:1721.1/1234) are not understo=
od by browsers. So, URL proxy of the Handle is displayed, e.g. http://hdl.handle.net/1721.1/1234. In=
various parts of the system, either form is used, or may be received as pa=
rt of some request, so some parsing must be done. This really needs to be u=
ntangled somehow if the persistent ID part of DSpace is to really be 'plugg=
able'.
The above only really considers the use case of external references, e.g. =
in journal articles. It does not speaks to what I consider separate issues:=
Although in some cases it may be possible and desirable for the persiste=
nt identifiers used for a long-term preservation purpose to be the same as =
those used for 1/ and/or 2/, there is certainly no requirement for this, an=
d should not be assumed.
At the moment for 1/ DSpace uses database primary keys to manage the relat=
ionships between these objects in the system. These are rather volatile =E2=
=80=93 any rebuilding of the database, importing/exporting, or software cha=
nges could well mean these database keys change and so they are totally ina=
ppropriate to use outside of the system itself. However they are of course =
very efficient for querying in the usual SQL SELECT/JOIN sort of way.
Given this, to facilitate 2/, DSpace currently assigns 'semi-persistent'=
IDs to bitstreams, of the form: http://host.name/bitstream/(item_=
handle)/(sequence_number)/filename.ext
I say semi-persistent, since although this is more resilient than database=
primary keys, it is still tied to the host name, and thus if the item chan=
ges custody, is moved to another server etc., the above URL would no longer=
work.
\begin
{
Insert formula here
}
\end