Title: An Add-On to facilitate the existing DSpace Batch Import Procedure

Student: Blooma Mohan John

Mentor: Jayan C Kurian

Co-Mentors: Stuart Lewis, Richard Jones, ???

Contents

Abstract:

Efficient content acquisition strategies make it easier to import scholarly information into repositories. DSpace supports batch content acquisition through the ItemImport procedure. This procedure requires digital resources to be represented in a Submission Information Package (SIP). The lead time required for preparing this format can be facilitated by encoding document metadata and digital resource location in a spreadsheet. This has been implemented at The Nanyang Technological University (Singapore), The Institute of Scientific and Technical Information of CNRS (INIST-CNRS, France), The University of Calgary Library, National Informatics Centre (India), and The Lanzhou Branch of Chinese Academy of Sciences (China). Few recent requests include The University of Waikato Library, The University of Sydney Library and the NITLE (U.S.A). Although the current implementation on Windows environment looks promising for the user community, there has been considerable request (New York University Library, Raman Research Institute Library (India)etc) to make this development compatible with the UNIX environment. It's anticipated that this add-on would facilitate content acquisition in DSpace installations.

Project Plan:

  1. Contact DSpace administrators from a pool of geographically spread DSpace instances and gather information regarding widely used mode of importing items in their repositories.
  2. Initially a standalone application would be implemented using Java.
  3. The developed prototype is implemented and tested on Windows and Linux platforms.
  4. In the next stage the program would be integrated as a stand alone web application using JSP technology. The user interface enables retrieving detailed metadata descriptions and item resource locations to automatically generate Submission Information Packages.
  5. Finally explore implementing this development as an Eclipse RCP Application (suggested by Mark R. Diggory, DSpace Systems Manager) to ensure OS portability.

Development Progress:

  • Survey Result & Analysis
    • Status : Completed
    • No of participants submitted : 43
    • Deliverables: Survey Report
      A survey was posted to the DSpace mailing list to receive users' feedback on importing documents into DSpace repositories. 43 participants responded to the online survey (managed through www.surveymonkey.com). It was found that 65% of participants have been using DSpace for less than 3 years. Interestingly, 9.3% of the participants have used DSpace for more than 5 years. The survey shows majority of participants have installations on Unix/Linux environment (74%) and only 20% are on Windows. It was found that the majority of participants' instances have collections with less than 500 documents (30%). However, 25% had their collections in the range 1,000 – 10,000 and above 10,000 items. 55% of users prefer DSpace submission interface for importing items, where as 44% of users use the in-built batch import feature. Further, it was found that 36 % of users had one fourth of their archived items imported using batch import and 26% of users had three fourth of their collection populated using the DSpace batch import. Hence it is evident that batch import facility is having a significant impact on ingesting items into digital repositories. It also highlights that majority of users (62%) developed their own customized program for the preparation of submission information packages. Only 29% performed the preparation of submission information packages manually, while 8% of users outsourced this to a vendor. 54% of users preferred a customized program (e.g. extracting metadata into an intermediate stage and generating submission information packages automatically) while just 21% of users preferred manual methods. There were two open-ended questions in the survey. The first question was to obtain suggestions from participants on the features of program they developed. The second one was on the choice of features potential users preferred for an automatic generation of submission information package to facilitate batch ingestion. The suggestions given by participants had very useful insights. It was also evident from the survey that DSpace instances are geographically spread with 30% from North America, 27% from Asia, 25% from Europe, 7% from Africa and Australia each.
      (Note: Please email me at BL0002HN@NTU.EDU.SG for a complete survey report in PDF format)
  • Automatic Metadata Generation (Windows environment/Microsoft Excel Driver)
    • Status : Completed
    • Software : Java SE 6
    • Editor : JCreator 4
    • Connectivity Components : ODBC/JDBC, Mircrosoft Excel Driver, Resultset
    • Operating System : Windows vista, Win XP, Win 2003 Standard, Win 2003 Enterprise
    • Test case 1
      • Collection: School of CEE Theses
      • Total no of files: 443
      • Total file size: 4.42 GB
      • Max file size: 44 MB
      • Min file size: 0.5 MB
      • Average file size: 10 MB
      • File type: PDF
      • System properties
        • Processor: Intel® Xenon ™ CPU 3.6 GHz, 3.6 GHz, 7.93 GB RAM
        • Operating System: MS Windows Server 2003 Enterprise Edition
        • Hardware: HP ProLiant DL380
      • Performance:
        • Total time taken for generating Submission Information Packages for 443 PDF documents: 12 minutes and 47 seconds
        • Average time taken for a single Submission Information Package generation: 1.73 seconds
    • Test case 2
      • Collection: School of CEE Theses
      • Total no of files: 14,176 (duplicated from Test case 1)
      • Max file size: 44 MB
      • Min file size: 0.5 MB
      • Average file size: 10 MB
      • File type: PDF
      • System properties (Same as Test case 1)
        • Processor: Intel® Xenon ™ CPU 3.6 GHz, 3.6 GHz, 7.93 GB RAM
        • Operating System: MS Windows Server 2003 Enterprise Edition
        • Hardware: HP ProLiant DL380
      • Performance:
        • Total time taken for generating Submission Information Packages for 14,176 PDF documents: 8 hours and 3 minutes
        • Average time taken for a single Submission Information Package generation: 2 seconds
  • Automatic Metadata Generation (Windows environment/OpenOffice.org 2.4 Calc)
    • Status : Completed
    • Software : Java SE 6
    • Editor : JCreator 4, Eclipse, NetBeans 6.1 (JDK6 Update 6 with NetBeans IDE Java SE bundle)
    • Connectivity Components : OpenOffice.org 2.4
    • Operating System: Windows Vista, Win XP
    • Test case 3
      • Collection: School of CEE Theses
      • Total no of files: 443
      • Total file size: 4.42 GB
      • Max file size: 44 MB
      • Min file size: 0.5 MB
      • Average file size: 10 MB
      • File type: PDF
      • System properties
        • Processor: Intel® Pentium(R) 4 CPU 1400MHz, 1.40 GHz, 640 MB RAM
        • Operating System: MS Windows XP Professional version 2002
        • Hardware: Dell Dimension 8100
      • Performance:
        • Total time taken for generating Submission Information Packages for 443 PDF documents: 12 minutes and 35 seconds.
        • Average time taken for a single Submission Information Package generation: 1.71 seconds
  • Automatic Metadata Generation (Linux environment/OpenOffice.org 2.4 Calc)
    • Status : Completed
    • Software : Java SE 6
    • Editor : NetBeans 6.1 (JDK6 Update 6 with NetBeans IDE Java SE bundle)
    • Connectivity Components : OpenOffice.org 2.4
    • Operating System: Fedora Core Release 4
      • System properties
        • Processor: Intel® Pentium(R) 4 CPU 1400MHz, 1.40 GHz, 640 MB RAM
        • Hardware: Dell Dimension 8100

Project Deliverables:

  • A stand-alone application to generate the Submission Information Package (SIP) automatically for facilitating DSpace batch import

Algorithm

  1. Start
  2. Input the Main Submission Information Package (SIP) folder name
  3. Create SIP folder with the name mentioned in Step 2
  4. Store digital object metadata and location details in a Resultset
  5. For each record in Resultset do Step 6 to Step 13
  6. Create individual SIP folder
  7. Create an xml file named dublin_core inside the SIP folder
  8. Create a file named contents inside the SIP folder
  9. Check the type of digital object
  10. Add comments about digital object to dublin core file
  11. Add digital object file name to contents file
  12. Copy digital object from an external location to individual SIP folder.
  13. Add metadata details to dublin core file.
  14. Stop

Future Work:

In the long run this project could be extended for an automatic extraction of metadata descriptions from digital resources (e.g. University theses that has standard format) using template based extraction techniques, populating an intermediate collection template to generate SIPs for an automatic batch import.

My University, School, and Me:

Nanyang Technological University

Wee Kim Wee School of Communication and Information

Blooma Mohan John

References: