Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The functionality of batch importing items in DSpace using the BTE has been incorporated in the "import" script already used in DSpace for years.

In the import script, there is a new option (option "-b") to import using the BTE and an option -i to declare the type of the input format. All the other options are the same apart from option "-s" that in this case points to a file (and not a directory as it used to) that is the file of the input data. However, in the case of batch BTE import, the option "-s" is not obligatory since you can configure the input from the Spring XML configuration file discussed later on. Keep in mind, that if option "-s" is defined, import will take that option into consideration instead of the one defined in the Spring XML configuration.
 
Thus, to import metadata from the various input formats use the following commands:

InputCommand
BibTex[dspace]/bin/dspace import -b -m mapFile -e  example@email.com -c 123456789/1 -s path-to-my-bibtex-file -i bibtex
CSV[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-csv-file -i csv
TSV[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-tsv-file -i tsv
RIS[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-ris-file -i ris
EndNote[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-endnote-file -i endnote
OAI-PMH[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-ris-file -i ris
arXiv[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-arxiv-file -i arxivXML
PubMed[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-pubmed-file -i pubmedXML
CrossRef[dspace]/bin/dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s path-to-my-crossref-file -i crossrefXML

...

This is the top level bean that describes the service of the batch import from the various external metadata formats. It accepts three properties:

a) dataLoaders: a list of all the possible data loaders that are supported. Keep in mind that for each data loader we specify a key that can be used as the value of option "-i" in the import script that we mentioned earlier. Here is the point where you would add a new custom DataLoader in case the default ones doesn't match your needs.

b) outputMap: a Map between the internal keys that BTE service uses to hold metadata and the DSpace metadata fields. (See later on, how data loaders specify the keys that BTE uses to hold the metadata)

c) transformationEngine: the BTE transformation engine that actually consisits of the processing steps that will be applied to metadata during their import to DSpace

...

The file data loaders have the following properties:

a) filename: it is a String that specifies the filepath to the file that the loader will read data from. If you specify this property, you do not need to give the option "-s" to the import script in the command prompt. If you, however, specify this property and you also provide a "-s" option in the command line, the option "-s" will be taken into consideration by the data loader.

b) fieldMap: it is a map that specifies the mapping between the keys that hold the metadata in the input file and the ones that we want to have internal in the BTE. This mapping is very important because the internal keys need to be declared in the "outputMap" of the "DataLoadeService" bean. Be aware that each data loader has each own input file keys. For example, RIS loader uses the keys "T1, AU, SO ... " while the TSV or CSV use the index number of the column that the value resides.

...

CSV and TSV (which is actually a CSV loader if you look carefully the class value of the bean) loaders have some more properties:

a) skipLines: A number that specifies the first line of the file that loader will start reading data. For example, if you have a csv file that the first row contains the column names, and the second row is empty, the the value of this property must be 2 so as the loader starts reading from row 2 (starting from 0 row). The default value for this property is 0.

b) separator: A value to specify the separator between the values in the same row in order to make the columns. For example, in a TSV data loader this value is "\u0009" which is the "Tab" character. The default value is "," and that is why the CSV data loader doesn't need to specify this property.

c) quoteChar: This property specifies the quote character used in the CSV file. The default value is the double quote character (").

...

The OAIPMHDataLoader has the following properties:

a) fieldMap: Same as above, the mapping between the input keys holding the metadata and the ones that we want to have internal in BTE.

b) serverAddress: The base address of the OAI provider (server). Base address can be specified also in the "-s" option of the command prompt. If is specified in both places, the one specified from the command line is preferred.

c) prefix: The metadata prefix to be used in OAI requests.

...

Code Block
languagehtml/xml
<bean id="customfilter"   class="org.mypackage.MyFilter" />

<bean id="linearWorkflowbatchImportLinearWorkflow" class="gr.ekt.bte.core.LinearWorkflow">
    <property name="steps">
    <list>
         <ref bean="customfilter" />
    </list>
    </property>
</bean>

You can add as many filters and modifiers you like to linearWorkflowto batchImportLinearWorkflow, they will run the one after the other in the specified order.