All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
...
The org.dspace.core package provides some basic classes that are used throughout the DSpace code.
...
The configuration manager service is responsible for reading the main dspace.cfg properties file, managing the 'template' configuration files for other applications such as Apache, and for obtaining the text for e-mail messages.
...
When editing configuration files for applications that DSpace uses, such as Apache Tomcat, you may want to edit the copy in [dspace-source]
and then run ant update
or ant overwrite_configs
rather than editing the 'live' version directly! This will ensure you have a backup copy of your modified configuration files, so that they are not accidentally overwritten in the future.
The ConfigurationManager ConfigurationService class can also be invoked as a command line tool:
[dspace]/bin/dspace dsprop property.name
This writes the value of property.name from dspace.cfg to the standard output, so that shell scripts can access the DSpace configuration. If the property has no value, nothing is written.For many more details on configuration in DSpace, see Configuration Reference
This class contains constants that are used to represent types of object and actions in the database. For example, authorization policies can relate to objects of different types, so the resourcepolicy table has columns resource_id, which is the internal ID of the object, and resource_type_id, which indicates whether the object is an item, collection, bitstream etc. The value of resource_type_id is taken from the Constants class, for example Constants.ITEM.
...
The primary reason for this is for determining authorization. In order to know whether an e-person may create an object, the system must know which container the object is to be added to. It makes no sense to create a collection outside of a community, and the authorization system does not have a policy for that.
Item_s Items are first created in the form of an implementation of _ InProgressSubmission. An InProgressSubmission represents an item under construction; once it is complete, it is installed into the main archive and added to the relevant collection by the InstallItem class. The org.dspace.content package provides an implementation of InProgressSubmission called WorkspaceItem; this is a simple implementation that contains some fields used by the Web submission UI. The org.dspace.workflow also contains an implementation called WorkflowItem which represents a submission undergoing a workflow process.
...
Instantiating a Bundle object causes the appropriate Bitstream objects (and hence _BitstreamFormat_s) to be instantiated.
Instantiating an Item object causes the appropriate Bundle objects (etc.) and hence _BitstreamFormat_s to be instantiated. All the Dublin Core metadata associated with that item are also loaded into memory.
...
Code Block |
---|
OutputStream destination = ...; PackageParameters params = ...; DSpaceObject dso = ...; PackageIngester dip = (PackageDisseminator) PluginManager .getNamedPlugin(PackageDisseminator.class, packageType); dip.disseminate(context, dso, params, destination); |
...
Note |
---|
In DSpace 6, the old "PluginManager" was replaced by |
The PluginService The PluginManager is a very simple component container. It creates and organizes components (plugins), and helps select a plugin in the cases where there are many possible choices. It also gives some limited control over the life cycle of a plugin.
...
...
The Plugin Manager Service supports three different patterns of usage:
getSinglePlugin()
method.getPluginSequence()
method.getNamedPlugin()
method and the getPluginNames getAllPluginNames()
methods....
This XSLT-crosswalk plugin has its own configuration that maps a Plugin Name to a stylesheet – it has to, since of course the Plugin Manager doesn't know anything about stylesheets. It becomes a self-named plugin, so that it reads its configuration data, gets the list of names to which it can respond, and passes those on to the Plugin Manager.
When the Plugin Manager Service creates an instance of the XSLT-crosswalk, it records the Plugin Name that was responsible for that instance. The plugin can look at that Name later in order to configure itself correctly for the Name that created it. This mechanism is all part of the SelfNamedPlugin class which is part of any self-named plugin.
...
The most common thing you will do with the Plugin Manager Service is obtain an instance of a plugin. To request a plugin, you must always specify the plugin interface you want. You will also supply a name when asking for a named plugin.
...
See the getSinglePlugin(), getPluginSequence(), getNamedPlugin() methods.
When PluginManager PluginService fulfills a request for a plugin, it checks whether the implementation class is reusable; if so, it creates one instance of that class and returns it for every subsequent request for that interface and name. If it is not reusable, a new instance is always created.
For reasons that will become clear later, the manager actually caches a separate instance of an implementation class for each name under which it can be requested.
You can ask the PluginManager to forget about (decache) a plugin instance, by releasing it. See the PluginManager.releasePlugin() method. The manager will drop its reference to the plugin so the garbage collector can reclaim it. The next time that plugin/name combination is requested, it will create a new instance.
The PluginService can The PluginManager can list all the names of the Named Plugins which implement an interface. You may need this, for example, to implement a menu in a user interface that presents a choice among all possible plugins. See the getPluginNamesgetAllPluginNames() method.
Note that it only returns the plugin name, so if you need a more sophisticated or meaningful "label" (i.e. a key into the I18N message catalog) then you should add a method to the plugin itself to return that.
Note: The PluginManager PluginService refers to interfaces and classes internally only by their names whenever possible, to avoid loading classes until absolutely necessary (i.e. to create an instance). As you'll see below, self-named classes still have to be loaded to query them for names, but for the most part it can avoid loading classes. This saves a lot of time at start-up and keeps the JVM memory footprint down, too. As the Plugin Manager gets used for more classes, this will become a greater concern.
The only downside of "on-demand" loading is that errors in the configuration don't get discovered right away. The solution is to call the checkConfiguration() method after making any changes to the configuration.
...
The PluginManager LegacyPluginServiceImpl class is your main interface to the Plugin Manager. It behaves like a factory class that never gets instantiated, so its public methods are static.the default PluginService implementation. While it is possible to implement your own version of PluginService, no other implementations are provided with DSpace
Here are the public methods, followed by explanations:
...
Object
getSinglePlugin(Class
interfaceClass)
- Returns an instance of the singleton (single) plugin implementing the given interface. There must be exactly one single plugin configured for this interface, otherwise the PluginConfigurationError is thrown. Note that this is the only "get plugin" method which throws an exception. It is typically used at initialization time to set up a permanent part of the system so any failure is fatal. See the plugin.single configuration key for configuration details.
Object[]
getPluginSequence(Class
interfaceClass)
- Returns instances of all plugins that implement the interface intface interfaceClass, in an Array. Returns an empty array if no there are no matching plugins. The order of the plugins in the array is the same as their class names in the configuration's value field. See the plugin.sequence configuration key for configuration details.
codeObject
getNamedPlugin(Class
intfaceinterfaceClass,
String
name)
; - Returns an instance of a plugin that implements the interface intface interfaceClass and is bound to a name matching name. If there is no matching plugin, it returns null. The names are matched by String.equals(). See the plugin.named and plugin.selfnamed configuration keys for configuration details.Code Block |
---|
static void releasePlugin(Object plugin); |
Tells the Plugin Manager to let go of any references to a reusable plugin, to prevent it from being given out again and to allow the object to be garbage-collected. Call this when a plugin instance must be taken out of circulation.
Code Block |
---|
static String[] getAllPluginNames(Class intface); |
Returns all of the names under which a named plugin implementing the interface intface can be requested (String[] getAllPluginNames(Class
- Returns all of the names under which a named plugin implementing the interface interfaceClass can be requested (with getNamedPlugin()). The array is empty if there are no matches. Use this to populate a menu of plugins for interactive selection, or to document what the possible choices are. The names are NOT returned in any predictable order, so you may wish to sort them first. Note: Since a plugin may be bound to more than one name, the list of names this returns does not represent the list of plugins. To get the list of unique implementation classes corresponding to the names, you might have to eliminate duplicates (i.e. create a Set of classes).interfaceClass
)
Code Block |
---|
static void checkConfiguration(); |
Validates the keys in the DSpace ConfigurationManager pertaining to the Plugin Manager and reports any errors by logging them. This is intended to be used interactively by a DSpace administrator, to check the configuration file after modifying it. See the section about validating configuration for details.
A named plugin implementation must extend this class if it wants to supply its own Plugin Name(s). See Self-Named Plugins for why this is sometimes necessary.
...
All of the Plugin ManagerService's configuration comes from the DSpace Configuration Manager, which is a Java Properties mapConfiguration Service (see Configuration Reference). You can configure these characteristics of each plugin:
...
Plugins Named in the Configuration A named plugin which gets its name(s) from the configuration is listed in this kind of entry:_plugin.named.interface = classname = name [ , name.. ] [ classname = name.. ]_The syntax of the configuration value is: classname, followed by an equal-sign and then at least one plugin name. Bind more names to the same implementation class by adding them here, separated by commas. Names may include any character other than comma (,) and equal-sign (=).For example, this entry creates one plugin with the names GIF, JPEG, and image/png, and another with the name TeX:
Code Block |
---|
plugin.named.org.dspace.app.mediafilter.MediaFilter = \ org.dspace.app.mediafilter.JPEGFilter = GIF, JPEG, image/png \ org.dspace.app.mediafilter.TeXFilter = TeX |
This example shows a plugin name with an embedded whitespace character. Since comma (,) is the separator character between plugin names, spaces are legal (between words of a name; leading and trailing spaces are ignored).This plugin is bound to the names "Adobe PDF", "PDF", and "Portable Document Format".
Code Block |
---|
plugin.named.org.dspace.app.mediafilter.MediaFilter = \ org.dspace.app.mediafilter.TeXFilter = TeX \ org.dspace.app.mediafilter.PDFFilter = Adobe PDF, PDF, Portable Document Format |
NOTE: Since there can only be one key with plugin.named. followed by the interface name in the configuration, all of the plugin implementations must be configured in that entry.
Self-Named Plugins Since a self-named plugin supplies its own names through a static method call, the configuration only has to include its interface and classname:plugin.selfnamed.interface = classname [ , classname.. ] _ The following example first demonstrates how the plugin class, _ XsltDisseminationCrosswalk is configured to implement its own names "MODS" and "DublinCore". These come from the keys starting with crosswalk.dissemination.stylesheet.. The value is a stylesheet file. The class is then configured as a self-named plugin:
Code Block |
---|
crosswalk.dissemination.stylesheet.DublinCore = xwalk/TESTDIM-2-DC_copy.xsl crosswalk.dissemination.stylesheet.MODS = xwalk/mods.xsl plugin.selfnamed.crosswalk.org.dspace.content.metadata.DisseminationCrosswalk = \ org.dspace.content.metadata.MODSDisseminationCrosswalk, \ org.dspace.content.metadata.XsltDisseminationCrosswalk |
NOTE: Since there can only be one key with plugin.selfnamed. followed by the interface name in the configuration, all of the plugin implementations must be configured in that entry. The MODSDisseminationCrosswalk class is only shown to illustrate this point.
Plugins are assumed to be reusable by default, so you only need to configure the ones which you would prefer not to be reusable. The format is as follows:
Code Block |
---|
plugin.reusable.classname = ( true | false ) |
For example, this marks the PDF plugin from the example above as non-reusable:
Code Block |
---|
plugin.reusable.org.dspace.app.mediafilter.PDFFilter = false |
The Plugin Manager is very sensitive to mistakes in the DSpace configuration. Subtle errors can have unexpected consequences that are hard to detect: for example, if there are two "plugin.single" entries for the same interface, one of them will be silently ignored.
To validate the Plugin Manager configuration, call the PluginManager.checkConfiguration() method. It looks for the following mistakes:
Eventually, someone should develop a general configuration-file sanity checker for DSpace, which would just call PluginManager.checkConfiguration().
Here are some usage examples to illustrate how the Plugin Manager works.
The existing DSpace 1.3 MediaFilterManager implementation has been largely replaced by the Plugin Manager. The MediaFilter classes become plugins named in the configuration. Refer to the configuration guide for further details.
This shows how to configure and access a single anonymous plugin, such as the BitstreamDispatcher plugin:
Configuration:
plugin.single.org.dspace.checker.BitstreamDispatcher=org.dspace.checker.SimpleDispatcher
The following code fragment shows how dispatcher, the service object, is initialized and used:
Code Block |
---|
BitstreamDispatcher dispatcher =
(BitstreamDispatcher)PluginManager.getSinglePlugin(BitstreamDispatcher
.class);
int id = dispatcher.next();
while (id != BitstreamDispatcher.SENTINEL)
{
/*
do some processing here
*/
id = dispatcher.next();
} |
This crosswalk plugin acts like many different plugins since it is configured with different XSL translation stylesheets. Since it already gets each of its stylesheets out of the DSpace configuration, it makes sense to have the plugin give PluginManager the names to which it answers instead of forcing someone to configure those names in two places (and try to keep them synchronized).
NOTE: Remember how getPlugin() caches a separate instance of an implementation class for every name bound to it? This is why: the instance can look at the name under which it was invoked and configure itself specifically for that name. Since the instance for each name might be different, the Plugin Manager has to cache a separate instance for each name.
Here is the configuration file listing both the plugin's own configuration and the PluginManager config line:
Code Block |
---|
crosswalk.dissemination.stylesheet.DublinCore = xwalk/TESTDIM-2-DC_copy.xsl
crosswalk.dissemination.stylesheet.MODS = xwalk/mods.xsl
plugin.selfnamed.org.dspace.content.metadata.DisseminationCrosswalk = \
org.dspace.content.metadata.XsltDisseminationCrosswalk |
This look into the implementation shows how it finds configuration entries to populate the array of plugin names returned by the getPluginNames() method. Also note, in the getStylesheet() method, how it uses the plugin name that created the current instance (returned by getPluginInstanceName()) to find the correct stylesheet.
Code Block |
---|
public class XsltDisseminationCrosswalk extends SelfNamedPlugin
{
....
private final String prefix =
"crosswalk.dissemination.stylesheet.";
....
public static String[] getPluginNames()
{
List aliasList = new ArrayList();
Enumeration pe = ConfigurationManager.propertyNames();
while (pe.hasMoreElements())
{
String key = (String)pe.nextElement();
if (key.startsWith(prefix))
aliasList.add(key.substring(prefix.length()));
}
return (String[])aliasList.toArray(new
String[aliasList.size()]);
}
// get the crosswalk stylesheet for an instance of the plugin:
private String getStylesheet()
{
return ConfigurationManager.getProperty(prefix +
getPluginInstanceName());
}
} |
The Stackable Authentication mechanism needs to know all of the plugins configured for the interface, in the order of configuration, since order is significant. It gets a Sequence Plugin from the Plugin Manager. Refer to the Configuration Section on Stackable Authentication for further details.
The primary classes are:
org.dspace.content.WorkspaceItem | contains an Item before it enters a workflow |
org.dspace.workflow.WorkflowItem | contains an Item while in a workflow |
org.dspace.workflow.WorkflowManager | responds to events, manages the WorkflowItem states |
org.dspace.content.Collection | contains List of defined workflow steps |
org.dspace.eperson.Group | people who can perform workflow tasks are defined in EPerson Groups |
org.dspace.core.Email | used to email messages to Group members and submitters |
Here are some usage examples to illustrate how the Plugin Service works.
The MediaFilterService implementation relies heavily on the Plugin Service. The MediaFilter classes become plugins named in the configuration. Refer to the Configuration Reference for further details.
This shows how to configure and access a single anonymous plugin, such as the BitstreamDispatcher plugin:
Configuration:
plugin.single.org.dspace.checker.BitstreamDispatcher=org.dspace.checker.SimpleDispatcher
The following code fragment shows how dispatcher, the service object, is initialized and used:
Code Block |
---|
BitstreamDispatcher dispatcher = (BitstreamDispatcher)PluginManager.getSinglePlugin(BitstreamDispatcher.class);
int id = dispatcher.next();
while (id != BitstreamDispatcher.SENTINEL)
{
/*
do some processing here
*/
id = dispatcher.next();
} |
This crosswalk plugin acts like many different plugins since it is configured with different XSL translation stylesheets. Since it already gets each of its stylesheets out of the DSpace configuration, it makes sense to have the plugin give PluginService the names to which it answers instead of forcing someone to configure those names in two places (and try to keep them synchronized).
Here is the configuration file listing both the plugin's own configuration and the PluginService config line:
Code Block |
---|
crosswalk.dissemination.stylesheet.DublinCore = xwalk/TESTDIM-2-DC_copy.xsl
crosswalk.dissemination.stylesheet.MODS = xwalk/mods.xsl
plugin.selfnamed.org.dspace.content.metadata.DisseminationCrosswalk = \
org.dspace.content.metadata.XsltDisseminationCrosswalk |
This look into the implementation shows how it finds configuration entries to populate the array of plugin names returned by the getPluginNames() method. Also note, in the getStylesheet() method, how it uses the plugin name that created the current instance (returned by getPluginInstanceName()) to find the correct stylesheet.
Code Block |
---|
public class XsltDisseminationCrosswalk extends SelfNamedPlugin
{
....
private final String prefix =
"crosswalk.dissemination.stylesheet.";
....
public static String[] getPluginNames()
{
List aliasList = new ArrayList();
Enumeration pe = ConfigurationManager.propertyNames();
while (pe.hasMoreElements())
{
String key = (String)pe.nextElement();
if (key.startsWith(prefix))
aliasList.add(key.substring(prefix.length()));
}
return (String[])aliasList.toArray(new
String[aliasList.size()]);
}
// get the crosswalk stylesheet for an instance of the plugin:
private String getStylesheet()
{
return ConfigurationManager.getProperty(prefix +
getPluginInstanceName());
}
} |
The Stackable Authentication mechanism needs to know all of the plugins configured for the interface, in the order of configuration, since order is significant. It gets a Sequence Plugin from the Plugin Manager. Refer to the Configuration Section on Stackable Authentication for further details.
The primary classes are:
org.dspace.content.WorkspaceItem | contains an Item before it enters a workflow |
org.dspace.workflow.WorkflowItem | contains an Item while in a workflow |
org.dspace.workflow.WorkflowService | responds to events, manages the WorkflowItem states. There are two implementations, the traditional, default workflow (described below) and Configurable Workflow. |
org.dspace.content.Collection | contains List of defined workflow steps |
org.dspace.eperson.Group | people who can perform workflow tasks are defined in EPerson Groups |
org.dspace.core.Email | used to email messages to Group members and submitters |
The default The workflow system models the states of an Item in a state machine with 5 states (SUBMIT, STEP_1, STEP_2, STEP_3, ARCHIVE.) These are the three optional steps where the item can be viewed and corrected by different groups of people. Actually, it's more like 8 states, with STEP_1_POOL, STEP_2_POOL, and STEP_3_POOL. These pooled states are when items are waiting to enter the primary states. Optionally, you can also choose to enable the enhanced, Configurable Workflow, if you wish to have more control over your workflow steps/states. (Note: the remainder of this description relates to the traditional, default workflow. For more information on the Configurable Workflow option, visit Configurable Workflow.)
The WorkflowService The WorkflowManager is invoked by events. While an Item is being submitted, it is held by a WorkspaceItem. Calling the start() method in the WorkflowManager WorkflowService converts a WorkspaceItem to a WorkflowItem, and begins processing the WorkflowItem's state. Since all three steps of the workflow are optional, if no steps are defined, then the Item is simply archived.
...
If a step is defined in a Collection's workflow, then the WorkflowItem's state is set to that step_POOL. This pooled state is the WorkflowItem waiting for an EPerson in that group to claim the step's task for that WorkflowItem. The WorkflowManager emails the members of that Group notifying them that there is a task to be performed (the text is defined in config/emails,) and when an EPerson goes to their 'My DSpace' page to claim the task, the WorkflowManager is invoked with a claim event, and the WorkflowItem's state advances from STEP_x_POOL to STEP_x (where x is the corresponding step.) The EPerson can also generate an 'unclaim' event, returning the WorkflowItem to the STEP_x_POOL.
Other events the WorkflowManager WorkflowService handles are advance(), which advances the WorkflowItem to the next state. If there are no further states, then the WorkflowItem is removed, and the Item is then archived. An EPerson performing one of the tasks can reject the Item, which stops the workflow, rebuilds the WorkspaceItem for it and sends a rejection note to the submitter. More drastically, an abort() event is generated by the admin tools to cancel a workflow outright.
...
The primary classes are:
org.dspace.authorize.AuthorizeManagerAuthorizeService | does all authorization, checking policies against Groups |
org.dspace.authorize.ResourcePolicy | defines all allowable actions for an object |
org.dspace.eperson.Group | all policies are defined in terms of EPerson Groups |
...
Three new attributes have been introduced in the ResourcePolicy class as part of the DSpace 3.0 Embargo Contribution:
...
Code Block |
---|
policy_id: 4847 resource_type_id: 2 resource_id: 89 action_id: 0 eperson_id: epersongroup_id: 0 start_date: 2013-01-01 end_date: rpname: Embargo Policy rpdescription: Embargoed through 2012 rptype: TYPE_CUSTOM |
The AuthorizeManager AuthorizeService class'
authorizeAction(Context, object, action) is the primary source of all authorization in the system. It gets a list of all of the ResourcePolicies in the system that match the object and action. It then iterates through the policies, extracting the EPerson Group from each policy, and checks to see if the EPersonID from the Context is a member of any of those groups. If all of the policies are queried and no permission is found, then an AuthorizeException is thrown. An authorizeAction() method is also supplied that returns a boolean for applications that require higher performance.
...
The org.dspace.handle package contains two classes; HandleManager HandleService is used to create and look up Handles, and HandlePlugin is used to expose and resolve DSpace Handles for the outside world via the CNRI Handle Server code.
...
The handle table maps these Handles to resource type/resource ID pairs, where resource type is a value from org.dspace.core.Constants and resource ID is the internal identifier (database primary key) of the object. This allows Handles to be assigned to any type of object in the system, though as explained in the functional overview, only communities, collections and items are presently assigned Handles.
HandleManager HandleService contains static methods for:
...
Note that since the Handle server runs as a separate JVM to the DSpace Web applications, it uses a separate 'Log4J' configuration, since Log4J does not support multiple JVMs using the same daily rolling logs. This alternative configuration is located at [dspace]/config/log4j-handle-plugin.properties
. The [dspace]/bin/start-handle-server
script passes in the appropriate command line parameters so that the Handle server uses this configuration.
Info |
---|
In additional to Handles, DSpace also provides basic support for DOIs (Digital Object Identifiers). For more information visit DOI Digital Object Identifier. |
DSpace's search code is a simple, configurable API which currently wraps the Lucene search engine. The first half of the search task is indexing, and Apache Solr. See Discovery for more information on how to customize the default search settings, etc.
The org.dspace.search.DSIndexer is the indexing class, which contains indexContent() which if passed an Item, Community, or Collection, will add that content's fields to the index. The methods unIndexContent() and reIndexContent() remove and update content's index information. The DSIndexer class also has a main() method which will rebuild the index completely. This can be invoked by the dspace/bin/index-init (complete rebuild) or dspace/bin/index-update (update) script. The intent was for the main() method to be invoked on a regular basis to avoid index corruption, but we have had no problem with that so far.
Which fields are indexed by DSIndexer? These fields are defined in dspace.cfg in the section "Fields to index for search" as name-value-pairs. The name must be unique in the form search.index.i (i is an arbitrary positive number). The value on the right side has a unique value again, which can be referenced in search-form (e.g. title, author). Then comes the metadata element which is indexed. '*' is a wildcard which includes all sub elements. For example:
Code Block |
---|
search.index.4 = keyword:dc.subject.* |
tells the indexer to create a keyword index containing all dc.subject element values. Since the wildcard ('*') character was used in place of a qualifier, all subject metadata fields will be indexed (e.g. dc.subject.other, dc.subject.lcsh, etc)
By default, the fields shown in the Indexed Fields section below are indexed. These are hardcoded in the DSIndexer class. If any search.index.i items are specified in dspace.cfg these are used rather than these hardcoded fields.
The query class DSQuery contains the three flavors of doQuery() methods‚ one searches the DSpace site, and the other two restrict searches to Collections and Communities. The results from a query are returned as three lists of handles; each list represents a type of result. One list is a list of Items with matches, and the other two are Collections and Communities that match. This separation allows the UI to handle the types of results gracefully without resolving all of the handles first to see what kind of content the handle points to. The DSQuery class also has a main() method for debugging via command-line searches.
Currently we have our own Analyzer and Tokenizer classes (DSAnalyzer and DSTokenizer) to customize our indexing. They invoke the stemming and stop word features within Lucene. We create an IndexReader for each query, which we now realize isn't the most efficient use of resources - we seem to run out of filehandles on really heavy loads. (A wildcard query can open many filehandles!) Since Lucene is thread-safe, a better future implementation would be to have a single Lucene IndexReader shared by all queries, and then is invalidated and re-opened when the index changes. Future API growth could include relevance scores (Lucene generates them, but we ignore them,) and abstractions for more advanced search concepts such as booleans.
The DSIndexer class shipped with DSpace indexes the Dublin Core metadata in the following way:
Search Field | Taken from Dublin Core Fields |
Authors | contributor.creator.description.statementofresponsibility |
Titles | title.* |
Keywords | subject.* |
Abstracts | description.abstractdescription.tableofcontents |
Series | relation.ispartofseries |
MIME types | format.mimetype |
Sponsors | description.sponsorship |
Identifiers | identifier.* |
The org.dspace.search package also provides a 'harvesting' API. This allows callers to extract information about items modified within a particular timeframe, and within a particular scope (all of DSpace, or a community or collection.) Currently this is used by the Open Archives Initiative metadata harvesting protocol application, and the e-mail subscription code.
The Harvest.harvest is invoked with the required scope and start and end dates. Either date can be omitted. The dates should be in the ISO8601, UTC time zone format used elsewhere in the DSpace system.
HarvestedItemInfo objects are returned. These objects are simple containers with basic information about the items falling within the given scope and date range. Depending on parameters passed to the harvest method, the containers and item fields may have been filled out with the IDs of communities and collections containing an item, and the corresponding Item object respectively. Electing not to have these fields filled out means the harvest operation executes considerable faster.
In case it is required, Harvest also offers a method for creating a single HarvestedItemInfo object, which might make things easier for the caller.
The browse API maintains indexes of dates, authors, titles and subjects, and allows callers to extract parts of these:
Ideally, a name that appears as an author for more than one item would appear in the author index only once. For example, 'Doe, John' may be the author of tens of items. However, in practice, author's names often appear in slightly differently forms, for example:
Code Block |
---|
Doe, John
Doe, John Stewart
Doe, John S. |
Currently, the above three names would all appear as separate entries in the author index even though they may refer to the same author. In order for an author of several papers to be correctly appear once in the index, each item must specify exactly the same form of their name, which doesn't always happen in practice.
Date of Issue: Items are indexed by date of issue. This may be different from the date that an item appeared in DSpace; many items may have been originally published elsewhere beforehand. The Dublin Core field used is date.issued. The ordering of this index may be reversed so 'earliest first' and 'most recent first' orderings are possible. Note that the index is of items by date, as opposed to an index of dates. If 30 items have the same issue date (say 2002), then those 30 items all appear in the index adjacent to each other, as opposed to a single 2002 entry. Since dates in DSpace Dublin Core are in ISO8601, all in the UTC time zone, a simple alphanumeric sort is sufficient to sort by date, including dealing with varying granularities of date reasonably. For example:
Code Block |
---|
2001-12-10
2002
2002-04
2002-04-05
2002-04-09T15:34:12Z
2002-04-09T19:21:12Z
2002-04-10 |
The API is generally invoked by creating a BrowseScope object, and setting the parameters for which particular part of an index you want to extract. This is then passed to the relevant Browse method call, which returns a BrowseInfo object which contains the results of the operation. The parameters set in the BrowseScope object are:
To illustrate, here is an example:
The results of invoking Browse.getItemsByTitle with the above parameters might look like this:
Code Block |
---|
Rabble-Rousing Rabbis From Sardinia
Reality TV: Love It or Hate It?
FOCUS> The Really Exciting Research Video
Recreational Housework Addicts: Please Visit My House
Regional Television Variation Studies
Revenue Streams
Ridiculous Example Titles: I'm Out of Ideas |
Note that in the case of title and date browses, Item objects are returned as opposed to actual titles. In these cases, you can specify the 'focus' to be a specific item, or a partial or full literal value. In the case of a literal value, if no entry in the index matches exactly, the closest match is used as the focus. It's quite reasonable to specify a focus of a single letter, for example.
Being able to specify a specific item to start at is particularly important with dates, since many items may have the save issue date. Say 30 items in a collection have the issue date 2002. To be able to page through the index 20 items at a time, you need to be able to specify exactly which item's 2002 is the focus of the browse, otherwise each time you invoked the browse code, the results would start at the first item with the issue date 2002.
Author browses return String objects with the actual author names. You can only specify the focus as a full or partial literal String.
Another important point to note is that presently, the browse indexes contain metadata for all items in the main archive, regardless of authorization policies. This means that all items in the archive will appear to all users when browsing. Of course, should the user attempt to access a non-public item, the usual authorization mechanism will apply. Whether this approach is ideal is under review; implementing the browse API such that the results retrieved reflect a user's level of authorization may be possible, but rather tricky.
The browse API contains calls to add and remove items from the index, and to regenerate the indexes from scratch. In general the content management API invokes the necessary browse API calls to keep the browse indexes in sync with what is in the archive, so most applications will not need to invoke those methods.
If the browse index becomes inconsistent for some reason, the InitializeBrowse class is a command line tool (generally invoked using the [dspace]/bin/dspace index-init
command) that causes the indexes to be regenerated from scratch.
Presently, the browse API is not tremendously efficient. 'Indexing' takes the form of simply extracting the relevant Dublin Core value, normalizing it (lower-casing and removing any leading article in the case of titles), and inserting that normalized value with the corresponding item ID in the appropriate browse database table. Database views of this table include collection and community IDs for browse operations with a limited scope. When a browse operation is performed, a simple SELECT query is performed, along the lines of:
Code Block |
---|
SELECT item_id FROM ItemsByTitle ORDER BY sort_title OFFSET 40 LIMIT 20 |
package also provides a 'harvesting' API. This allows callers to extract information about items modified within a particular timeframe, and within a particular scope (all of DSpace, or a community or collection.) Currently this is used by the Open Archives Initiative metadata harvesting protocol application, and the e-mail subscription code.
The Harvest.harvest is invoked with the required scope and start and end dates. Either date can be omitted. The dates should be in the ISO8601, UTC time zone format used elsewhere in the DSpace system.
HarvestedItemInfo objects are returned. These objects are simple containers with basic information about the items falling within the given scope and date range. Depending on parameters passed to the harvest method, the containers and item fields may have been filled out with the IDs of communities and collections containing an item, and the corresponding Item object respectively. Electing not to have these fields filled out means the harvest operation executes considerable faster.
In case it is required, Harvest also offers a method for creating a single HarvestedItemInfo object, which might make things easier for the caller.
The browse API uses the same underlying technology as the Search API (Apache Solr, see also Discovery). It maintains indexes of dates, authors, titles and subjects, and allows callers to extract parts of these:
Ideally, a name that appears as an author for more than one item would appear in the author index only once. For example, 'Doe, John' may be the author of tens of items. However, in practice, author's names often appear in slightly differently forms, for example:
Code Block |
---|
Doe, John
Doe, John Stewart
Doe, John S. |
Currently, the above three names would all appear as separate entries in the author index even though they may refer to the same author. In order for an author of several papers to be correctly appear once in the index, each item must specify exactly the same form of their name, which doesn't always happen in practice.
Date of Issue: Items are indexed by date of issue. This may be different from the date that an item appeared in DSpace; many items may have been originally published elsewhere beforehand. The Dublin Core field used is date.issued. The ordering of this index may be reversed so 'earliest first' and 'most recent first' orderings are possible. Note that the index is of items by date, as opposed to an index of dates. If 30 items have the same issue date (say 2002), then those 30 items all appear in the index adjacent to each other, as opposed to a single 2002 entry. Since dates in DSpace Dublin Core are in ISO8601, all in the UTC time zone, a simple alphanumeric sort is sufficient to sort by date, including dealing with varying granularities of date reasonably. For example:
Code Block |
---|
2001-12-10
2002
2002-04
2002-04-05
2002-04-09T15:34:12Z
2002-04-09T19:21:12Z
2002-04-10 |
The API is generally invoked by creating a BrowseScope object, and setting the parameters for which particular part of an index you want to extract. This is then passed to the relevant Browse method call, which returns a BrowseInfo object which contains the results of the operation. The parameters set in the BrowseScope object are:
To illustrate, here is an example:
The results of invoking Browse.getItemsByTitle with the above parameters might look like this:
Code Block |
---|
Rabble-Rousing Rabbis From Sardinia
Reality TV: Love It or Hate It?
FOCUS> The Really Exciting Research Video
Recreational Housework Addicts: Please Visit My House
Regional Television Variation Studies
Revenue Streams
Ridiculous Example Titles: I'm Out of Ideas |
Note that in the case of title and date browses, Item objects are returned as opposed to actual titles. In these cases, you can specify the 'focus' to be a specific item, or a partial or full literal value. In the case of a literal value, if no entry in the index matches exactly, the closest match is used as the focus. It's quite reasonable to specify a focus of a single letter, for example.
Being able to specify a specific item to start at is particularly important with dates, since many items may have the save issue date. Say 30 items in a collection have the issue date 2002. To be able to page through the index 20 items at a time, you need to be able to specify exactly which item's 2002 is the focus of the browse, otherwise each time you invoked the browse code, the results would start at the first item with the issue date 2002.
Author browses return String objects with the actual author names. You can only specify the focus as a full or partial literal String.
Another important point to note is that presently, the browse indexes contain metadata for all items in the main archive, regardless of authorization policies. This means that all items in the archive will appear to all users when browsing. Of course, should the user attempt to access a non-public item, the usual authorization mechanism will apply. Whether this approach is ideal is under review; implementing the browse API such that the results retrieved reflect a user's level of authorization may be possible, but rather trickyThere are two main drawbacks to this: Firstly, LIMIT and OFFSET are PostgreSQL-specific keywords. Secondly, the database is still actually performing dynamic sorting of the titles, so the browse code as it stands will not scale particularly well. The code does cache BrowseInfo objects, so that common browse operations are performed quickly, but this is not an ideal solution.
Checksum checker is used to verify every item within DSpace. While DSpace calculates and records the checksum of every file submitted to it, the checker can determine whether the file has been changed. The idea being that the earlier you can identify a file has changed, the more likely you would be able to record it (assuming it was not a wanted change).
...