Table of Contents |
---|
Data Overview
Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:
...
- 4.0K DC
- 20K Feigenbaum_00013946-METS.xml
- 4.0K Feigenbaum_00013946-TEXT.xml
- 4.0K RELS-EXT
- 44K bd826tf2716.pdf
- 4.0K bd826tf2716.txt
- 72K bd826tf2716_00001.jp2
- 8.0K bd826tf2716_00001.xml
- 64K bd826tf2716_00002.jp2
- 8.0K bd826tf2716_00002.xml
- 68K bd826tf2716_000BW.jp2
- 4.0K checksum
- 4.0K descMetadata
- 4.0K extracted_entities.xml
- 4.0K flipbook.json
- 4.0K flipbook.old
- 0 location
- 0 properties
- 0 stories
- 4.0K thumb.jpg
- 4.0K zotero.xml
Test 1: Simple Ingest into Fedora 3
For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.
...
Code Block |
---|
#!/bin/bash base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora" RuntimePrint() { duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./') echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec" echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time } CreateObject() { pid="druid:$1" curl -X POST "$base_url/objects/$pid" &> /dev/null cd /data-ro/assets/$1 for f in $( ls ); do datastreams=$[$datastreams+1] size=$[$size+`stat -c "%s" $f`] curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M" &> /dev/null done cd /data } BenchmarkObject() { objectId=$1 if [ -d /data-ro/assets/$objectId ]; then m1t=$(date +%s%N); m1l=$LINENO CreateObject $objectId m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint fi } export -f BenchmarkObject export -f CreateObject export -f RuntimePrint export base_url cat - | parallel -P $THREADS --env _ BenchmarkObject |
Test 1a: Single-threaded ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.39300 1.46300 2.31500 6.64700 11.65780 36.77232 353.48700 |
0.2597 objects/s (objects per second)
Test 1b: Single-threaded iteration
Retrieve object profile
Code Block |
---|
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.002 0.054 0.062 0.077 0.089 0.123 3.580 |
Test 1c: 8-thread ingest test
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.71400 3.46600 5.09440 14.32520 23.55560 72.11128 632.71200 2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k 584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps Tue Nov 19 19:09:08 PST 2013 : 16693 objects 1.031 objects/s (objects per second) |
Test 1d: Multi-threaded iteration test
Code Block |
---|
4 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.013 0.014 0.017 0.020 0.031 0.073 Tue Nov 19 14:08:44 PST 2013 : retrieving all objects 160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k 0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps Tue Nov 19 14:11:26 PST 2013 : 16693 objects Tue Nov 19 14:11:26 PST 2013 : done 103 objects/s (objects per second) 8 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.022 0.025 0.031 0.034 0.045 0.093 Tue Nov 19 14:05:10 PST 2013 : retrieving all objects 159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k 0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps Tue Nov 19 14:07:39 PST 2013 : 16693 objects Tue Nov 19 14:07:39 PST 2013 : done 112.1 objects/s (objects per second) 16 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.012 0.024 0.028 0.035 0.040 0.052 0.149 Tue Nov 19 14:11:50 PST 2013 : retrieving all objects 161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k 0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps Tue Nov 19 14:14:20 PST 2013 : 16693 objects Tue Nov 19 14:14:20 PST 2013 : done 110.9 objects/s (objects per second) |
Test 2: Simple Ingest into Fedora 4
Ingest all the data into fcrepo4 as datastreams on objects.
Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071
Using curl:
Test 2a: Ingest all the data as objects and datastreams, one at a time
Test 2b: Ingest all the data as objects arranged in a druid tree
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.6590 6.1800 9.1844 24.7972 44.7202 226.7644 1094.8120 |
Ingest speed over time
Test 2c: Ingest all the data as objects in a druid tree AND use fcr:batch
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.4670 5.2440 7.8554 20.3558 33.8186 101.2618 711.0130 |
Test 2d: Use a 4-node cluster to do a druid-tree ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.9050 9.7360 13.9442 29.6692 48.0332 146.0208 1109.1760 |
Test 3: Realistic Ingest into Fedora 3
Ingest all the data into fcrepo3 making reasonable content modeling assumptions:
- each page as an object
- ?
Using ActiveFedora:
Test 4: Realistic Ingest into Fedora 4
- add RDF as properties on objects
- Each page as a ordered same-name sibling on an object
...