Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The functionality of batch importing items in DSpace using the BTE has been incorporated in the "import" script already used in DSpace for years.

In the import script, there is a new option (option "-b") to import using the BTE and an option -i to declare the type of the input format. All the other options are the same apart from option "-s" that in this case points to a file (and not a directory as it used to) that is the file of the input data. However, in the case of batch BTE import, the option "-s" is not obligatory since you can configure the input from the Spring XML configuration file discussed later on. Keep in mind, that if option "-s" is defined, import will take that option into consideration instead of the one defined in the Spring XML configuration.
 
Thus, to import metadata from the various input formats use the following commands:

...

Keep in mind that the value of option "-e" must be a valid email of a DSpace user and value of option "-c" must be the collection handle the items will be imported to.


BTE Configuration

Since, DSpace administrators may have incorporated their own metadata scheme within DSpace (apart from default Dublin Core), someone may need to configure BTE to match their custom schemesThe basic idea behind BTE is that the system holds the metadata in an internal format using a specific key for each metadata field. DataLoaders load the record using the aforementioned keys, while the output generator needs to map these keys to DSpace metadata fields.

BTE configuration file is located in path: [dspace]/config/spring/api/bte.xml and it's a Spring XML configuration file that consists of beans. (If these terms are unknown to you, please refer to Spring Dependency Injection web site for more information)  

Explanation of beans:

beam   id= "gr.ekt.transformationenginebte.core.TransformationEngine"

This bean is instantiated when the import takes place. It deploys a new  BTE transformation engine that will do the transformation from one format to the other. It needs to one input argumentsargument, the workflow (the processing step mentioned before) and the output generatorthat will run when transformation takes place. Normally, you don't need to modify this bean.

bean   id= "org.dspace.app.itemimport.DataLoaderService"

Within this bean we declare all the possible data loaders that we need to support. Keep in mind that for each data loader we specify a key value that can be used as the value of option "-i" in the import script that we mentioned earlier. Here is the point where you would add a new custom DataLoader in case the default ones doesn't match your needs.

 Moreover, this bean holds the "outputMap" which is a Map between the internal keys that BTE uses to hold metadata and the DSpace metadata fields. (See later on, how data loaders specify the keys that BTE uses to hold the metadata)

bean   id= "conjunctionTransformationWorkflowlinearWorkflow"

This bean describes the processing steps. Currently there are no processing steps meaning that all records loaded by the data loader will pass to the output generator, unfiltered and unmodified. ( See next section "Case studies" for info about how to add a filter or a modifier )

bean   id= "gr.ekt.bteio.loaders.BibTeXDataLoader"

bean   id= "gr.ekt.bteio.loaders.CSVDataLoader"

bean   id= "gr.ekt.bteio.loaders.TSVDataLoader"

bean   id= "outputGeneratorgr.ekt.bteio.loaders.RISDataLoader" This bean declares the output generator, that in this case the default DSpace output generator is used. As anyone can see, the input properties for this bean is the configuration of metadata scemes that are used in the DSpace repository. This is the place where someone can configure how the input data map to the metadata schema of DSpace. The references maps are actually key/value pairs where key is the record key mentioned earlier and the value is the metadata element (schema, element, qualifier) that this key will map to.
By default, the data loaders provided with the BTE use as record keys the values that are specified in the input data file. For example, in the case of a bibtex input:
 
================
@article{small,
author = {Freely, I.P.},
title = {A small paper},
journal = {The journal of small papers},
year = 1997,
volume = {-1},
note = {to appear},
}

================
 
the record keys will be "author", "title", "journal", "year" and so on.
 
In the case of a RIS format input file:
 
================
TY  - JOUR
AU  - Shannon,Claude E.
PY  - 1948/07//
TI  - A Mathematical Theory of Communication
JO  - Bell System Technical Journal
SP  - 379
EP  - 423
VL  - 27
ER  -

=================
 
the record keys will be "TY", "AU" and so on.
 
Thus, depending on the value of "-i" option in import script, the user needs to change the configuration of this bean accordingly. 

Case Studies

bean   id= "gr.ekt.bteio.loaders.EndnoteDataLoader"

 
Each one of them has the following properties:

a) filename: it is a String that specifies the filepath to the file that the loader will read data from. If you specify this property, you do not need to give the option "-s" to the import script in the command prompt. If you, however, specify this property and you also provide a "-s" option in the command line, the option "-s" will be taken into consideration by the data loader.

b) fieldMap: it is a map that specifies the mapping between the keys that hold the metadata in the input file and the ones that we want to have internal in the BTE. This mapping is very important because the internal keys need to be declared in the "outputMap" of the "DataLoadeService" bean. Be aware that each data loader has each own input file keys. For example, RIS loader uses the keys "T1, AU, SO ... " while the TSV or CSV use the index number of the column that the value resides.

Some loaders do have more properties:

CSV and TSV (which is actually a CSV loader if you look carefully the class value of the bean) loaders have some more properties:

a) skipLines: it is a number that specifies the first line of the file that loader will start reading data. For example, if you have a csv file that the first row contains the column names, and the second row is empty, the the value of this property must be 2 so as the loader starts reading from row 2 (starting from 0 row). The default value for this property is 0.

b) separator: it is a value to specify the separator between the values in the same row in order to make the columns. For example, in a TSV data loader this value is "\u0009" which is the "Tab". The default value id "," and that is why the CSV data loader doesn' t need to specify this property

c) quoteChar: this property specifies the the quote char of the csv and the default value is "

 

So, in case you need to pass through the system more metadata fields than the ones that are specified by default, you need to change the data loaders configuration and the output map


Case Studies

Since, DSpace administrators may have incorporated their own metadata scheme within DSpace (apart from default Dublin Core), someone may need to configure BTE to match their custom schemes.

1) I have my data in a format different from the ones that are supported by this functionality. What can I do?
 
In this case, you will Either you try to easily transform your data in a supported format or you need to create a new data loader. To do this, just create a new Java class by extending the abstract classthat implements the following java Interface 
 

        gr.ekt.transformationenginebte.core.DataLoader

 
of BTE. You will need to implement the method
 

    public abstract RecordSet loadData  public RecordSet getRecords() ;throws MalformedSourceException {

in which you have to create records from the input file - most propably you will need to create your own Record class (by extending implementing gr.ekt.transformationenginebte.core.Record class) and fill a RecordSet. 

 
After that, you will need to declare the new DataLoader in the Spring XML configuration file (in the bean with id: "org.dspace.app.itemimport.DataLoaderService ") using your own key. Use this key as a value for option "-i" in the import key so as to specify that your data loader must run.

 
 
2) I need to filter some of the input records or modify some value from records before outputting them
 
In this case you will need to create your own filters and modifiers.
 
To create a new filter, you need to extend BTE abstact class:

    gr.ekt.transformationenginebte.core.FilterAbstractFilter

You will need to implement the method:

    public   abstract  void filter boolean  isIncluded ( Record  record )

Return true false if the specified record needs to be filtered, otherwise return falsetrue.
 
To create a new modifier, you need to extend BTE abstact class:

      gr.ekt.transformationenginebte.core.ModifierAbstractModifier

You will need to implement the method

    public   abstract  void Record modify ( Record record )

within you can make any changes you like in the record. You can use the Record methods to get the values for a specific key and load new ones (For the latter, you need to make the Record mutable) 

After you create your own filters or modifiers you need to add them in the Spring XML configuration file as in the following example:

 
<bean   id= "customfilter"   class= "org.mypackage.MyFilter" />

<bean   id= "conjunctionTransformationWorkflowlinearWorkflow class= "gr.ekt.transformationenginebte.core.ConjunctionTransformationWorkflowLinearWorkflow" >
  <property   name= "steps" >
  <list>
       <ref bean= "customfilter"/>  
  </list>
  </property>
</bean>

You can add in the conjunctionTransformationWorkflow linearWorkflow  as many filters and modifiers you like, the will run the one after the other in the specified order.