Introduction

The Sync Tool is a utility which was created in order to provide a simple way to move files from a local file system to DuraCloud and subsequently keep the files in DuraCloud synchronized with those on the local system.

Download

Download the Sync Tool from the Downloads page.

Getting Started

The Sync Tool can be installed using one of the installers on the downloads page linked above. Once installed, the Sync Tool will default to running in GUI mode. To run in command line mode, open a terminal window (or command prompt) and navigate to the Sync Tool installation directory. Once there, execute the Sync Tool JAR file using: "java -jar duracloudsync.jar --help". This will print the usage information for the tool.

How the Sync Tool Works

Operational notes

Large Datasets and Out of Memory Errors

When using the SyncTool to transmit data sets with a large number of files (i.e. hundreds of thousands of files or more) users occasionally run into out of memory errors.  Users with sufficient memory resources on their machines can usually remedy this problem by increasing the maximum heap space available to the Java VM.  We recommend starting with a setting of at least 1 GB when working with sets over 100,000 files. If the problem persists, try increasing the memory value until the problem ceases to manifest. To increase the heap space use the -Xmx java option.  Click for more information on setting the heap space.

An alternative solution is to upload files in smaller sets. The prefix option can be used to ensure that files are added to DuraCloud with the preferred ID values.

To run the SyncTool in UI mode with 1 GB of heap memory space, download the Jar version of the SyncTool and execute the following on the command line:

java -Xmx1g -jar duracloudsync-{version}.jar

Alternatively, you can set the system environment variable JAVA_TOOL_OPTIONS to a value like "-Xmx1g", which will be picked up by the SyncTool on startup, meaning that you can start up the SyncTool UI as usual.

To run the SyncTool in command-line mode with 1 GB of heap memory space, download the Jar version of the SyncTool and execute the above command followed by the command line parameter values.

Large Files

When the SyncTool encounters a large file (by default, this is 1 GB+, but this can be set up to 5GB via the --max-file-size parameter) it will "chunk" that file prior to transfer to DuraCloud. This means that the file will land in DuraCloud as multiple components with an associated manifest file to indicate the set of component files and the checksum of each. As part of this process, the SyncTool will create a local temporary file for each chunk prior to transfer, as this allows the tool to generate the checksum for that chunk, and also allows retries on failure. These temporary files are stored in the default java temp directory.

The number and size of temp files which may be created depends on the number of threads and the max chunk size settings. Each thread has the potential of creating one temp file at a time and the size of the temp files can be up to the max chunk size. So multiplying the number of threads setting by the max chunk size will tell you the maximum number of GBs that may be consumed on local storage at one time. The SyncTool removes temp files as transfers complete, but if the tool it terminated abruptly, some of those temp files can be orphaned (and may require manual cleanup.)

If you'd like to change the location of where temp files are stored, this can be done with the "java.io.tmpdir" system property. This can be done on the command line, by adding  "-Djava.io.tmpdir=/yourpath" after "java" on the command line. Alternatively, you can set the system environment variable JAVA_TOOL_OPTIONS to this value ("-Djava.io.tmpdir=/yourpath") and it will be picked up as the tool starts.

Prerequisites

As of DuraCloud version 7.0.0, the Sync Tool requires Java 11 to run. The latest version of Java can be downloaded from here.

Using the Sync Tool

Runtime commands

Running the Sync Tool in a server shell environment

As noted above, the Sync Tool can be run in one of two modes, one which allows it to run continually, and the other which allows it to exit once it completes transferring all current files. The mode you choose will determine the way in which you deploy the Sync Tool on a server. The following examples assume the use of the bash shell.

To start the Sync Tool in continually running mode, you would use a command like this:

nohup java -jar duracloudsync-{version}.jar {parameters} > ~/synctool-output.log 2>&1 &
In this case, the & at the end of the command instructs the command to run in the background, and the "nohup" at the beginning tells the command to continue running even when the terminal being used is closed or when you disconnect from the server machine. The output of the Sync Tool would be placed in a file called "synctool-output.txt" in the user's home directory.
In order for the Sync Tool to be run on startup when the server machine boots, additional settings will need to be added which depend on the operating system being used. In Ubuntu, for example, an Upstart script could be used for this purpose.
Running the Sync Tool in exit on completion mode works best when the tool is run on a scheduled basis. A popular choice for handling this type of task is the cron utility. To run daily using cron a script should be placed in /etc/cron.daily. The script would look something like:


#!/bin/bash

if ps -ef | grep -v grep | grep duracloudsync ; then
  echo 'DuraCloud Sync is Running'
  exit 0
else
  echo 'Starting DuraCloud Sync'
  java -jar duracloudsync-{version}.jar -x [parameters] >> ~/synctool-output.log 2>&1 &
  exit 0
fi

The -x parameter is included here to ensure the Sync Tool exists after completing its run. This script also includes a check to ensure that the tool is not already running before starting.