Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Table of Contents

Excerpt

Large files can be uploaded via the REST API, or projected into the repository using filesystem federation.  Transfer times for uploading to the repository via the REST API are about the same as copying using NFS, and moderately faster than using SCP.  Uploading via the REST API to a federated filesystem is significantly slower and requires a large temp directory capacity.

Java command-line options and System properties can be used

  • -Xmx2048m – maximum memory Java can use
  • -Dfcrepo.home=/path/to/data – set the directory for permanent data
  • -Djava.io.tmpdir=/path/to/tmpdir – set the directory for temp files.  Data uploaded to a federated filesystem via the REST API is written to a temp file in this directory before being moved to the federated filesystem, so this directory should have enough free space for the largest files you will upload.

Ingesting Large Files via the REST API

Based on the tests below, we believe arbitrarily-large files can be ingested and downloaded via the REST API (tested up to 1TB).  The only apparent limitations are disk space available to store the files, and a sufficiently large Java heap size.

Note
To enable fast access to large files, it is necessary to set "contentBasedSha1" : "false".  Otherwise the repository will run a SHA1 on the content for identification that could take hours when reaching into the range of > 50Gb.  For more on this benchmarking see: Design - LargeFiles.
Warning
This is work in progress!
Warning
titleOutOfMemoryException when ingesting large files

Currently there seems to be a bug, which creates OutOfMemoryExceptions when ingesting files that are larger than available heap space with certain infinispan configurations (e.g. LevelDB). It seems like this is an issue with the Modeshape project which has been reported at: https://issues.jboss.org/browse/MODE-2103

The following TestCase can be used to reproduce the issue: https://github.com/futures/large-files-test

Workaround

You will need a large heap size for this to work (e.g. -Xmx2048g)

Currently the only known workaround is using a _file_ configuration for infinspan caches e.g.: https://github.com/futures/fcrepo4/blob/34aab66bc26edfca3a4cbabecc4870bfd81f05da/fcrepo-http-commons/src/main/resources/config/single-file/repository.json.

This can be done by setting the following property:

-Dfcrepo.modeshape.configuration=config/single-file/repository.json

 

Large Files on a Single Node Fedora 4 Installation

Use config: CATALINA_OPTS="-Dfcrepo.modeshape.configuration=classpath:/config/single-file/repository.json" bin/catalina.sh run

 

 

Tip

Using the single-file configuration ingest and retrieval of files up to the size of 300 GB using Fedora 4's REST API were tested successfully. The files were ingested sequentially, retrieved and a bitwise comparison with the original data has been performed. Larger sizes have not been tested, due to HDD size limitations.

 

Large File Upload/Download Roundtrip Tests

File SizeUploadDownload
256GB15,488,156ms (16.9MB3h51m34s (18.87MB/sec)3,306,756ms (79.3MB43m09s (101.25MB/sec)
512GB  

Federated Content Large File Download Roundtrip Tests

...

7h49m43s (18.60MB/sec)1h29m15s (97.90MB/sec)
1TB15h41m21s (18.57MB/sec)3h21m44s (86.63MB/sec)

Serving Large Files via Filesystem Federation

Based on the tests below, we believe arbitrarily-large files can be projected into the repository via filesystem federation and downloaded via the REST API (tested up to 1TB).  The only apparent limitations are disk space available to store the files, and a sufficiently large Java heap size.

Code Block
"externalSources" : {
    "filesystem" : {
        "classname" : "org.fcrepo.connector.file.FedoraFileSystemConnector",
        "directoryPath" : "/mnt/isilon/fedora-dev/federated",
        "projections" : [ "default:/projection => /" ],
        "readonly" : true,
        "addMimeTypeMixin" : true,
        "contentBasedSha1" : "false"
    }
}
File Size

Download

256 GB1h09m26s (62.92MB/sec)
512 GB2h00m15s (72.67MB/sec)
1 TB3h57m25s (73.61MB/sec)

 

Direct Comparison of Different Transfer Methods

Based on the tests below, we believe arbitrarily-large files can be uploaded and downloaded via the REST API, using either repository storage or a federated filesystem (tested up to 1TB).  The only apparent limitations are disk space available to store the files, temp directory capacity, and a sufficiently large Java heap size.

Comparison of Upload and Download Times for Different Transfer Methods

Transfer MethodFile SizeUploadDownload
REST API (Federated)1TB732 min (84 GB/sec)246 min (250 GB/sec)
REST API (Repository)1TB339 min (181 GB/sec)250 min (246 GB/sec)
SCP1TB383 min (160 GB/sec) 
NFS1TB336 min (183 GB/sec) 

Copying Files Between Federated Filesystem and Repository Storage

SourceDestinationFile SizeCopy Time
Repository storageFederated filesystem1TB402 min (153 GB/sec)
Federated filesystemRepository storage1TB345 min (178 GB/sec)

Range Retrieval

Retrieving a byte range is supported and has been tested with 1TB files for both repository storage and federated filesystem.  There is an integration test in the standard test suite for verifying that range retrieval works.  By default, this test uses a small binary size to avoid slowing down the test suite, but the size is configurable so it is easy for a developer to test files as large as local disk space allows.

 

Repository Profile: Single-File with an additional external Resource:

...

File SizeProjection Directory Request DurationFirst Projected Node Request Duration

Download Duration

Throughput
2 GB0m35.117s0m34.572s0m8.236s248.66 mb/sec
10 GB    

100 GB

  

 

 
300 GB    
10*10 GB    

 

Related articles

Content by LabelshowLabelsfalsemax5spacesFFsortmodifiedshowSpacefalsereversetruetypepagelabelslarge-files