Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:

 

Code Block
> quantile(file_sizes$size, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100%
0 43447 195719 1835010 288032768
 
> quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1))
        0%        50%        70%        90%        95%        99%       100% 
     51302    4680240   10743517   39806094   72375144  221829327 1230705287 

Code Block
> quantile(file_counts$X1,  c(0, .5, .7, .9, .95, .99, 1))
     0%     50%     70%     90%     95%     99%    100% 
   7.00   22.00   29.00   62.00   99.40  280.08 1478.00

 

 

 

In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).

...

Using Fedora 3.7.1, clean install, using these properties:

Code Block
database=mysql
database.driver=included
database.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.driver=included
database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.username=fedora
database.password=redacted
install.type=custom
deploy.local.services=false
install.tomcat=false
servlet.engine=existingTomcat
fedora.home=/home/lyberadmin/apps/fedora/home
fedora.serverHost=sul-fedora-dev-a.stanford.edu
fedora.serverContext=fedora
tomcat.http.port=8080
tomcat.shutdown.port=8005
ssl.available=true
tomcat.ssl.port=8443
tomcat.home=/usr/share/tomcat6
ri.enabled=true
messaging.enabled=false
messaging.uri=
apim.ssl.required=false
apia.ssl.required=false
apia.auth.required=false
fesl.authz.enabled=false
fesl.authn.enabled=true
xacml.enabled=false
keystore.file=included

 

Tomcat is proxied through an Apache HTTPD server.

 

Using bash:

Code Block
#!/bin/bash
base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora"

RuntimePrint()
{
 duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./')
 echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec"
 echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time
}

CreateObject() {
    pid="druid:$1"
    curl -X POST "$base_url/objects/$pid" &> /dev/null
    cd /data-ro/assets/$1

    for f in $( ls ); do
      datastreams=$[$datastreams+1]
      size=$[$size+`stat -c "%s" $f`]
      curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M"  &> /dev/null
    done
    cd /data
}

BenchmarkObject() {
  objectId=$1
  if [ -d /data-ro/assets/$objectId ]; then
    m1t=$(date +%s%N); m1l=$LINENO
    CreateObject $objectId
    m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint
  fi
}

export -f BenchmarkObject
export -f CreateObject
export -f RuntimePrint
export base_url

cat - | parallel -P $THREADS --env _ BenchmarkObject

 

Test 1a: Single-threaded ingest

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
  0.39300   1.46300   2.31500   6.64700  11.65780  36.77232 353.48700 
0.2597 objects/s  (objects per second)

...

Retrieve object profile

Code Block
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.002 0.054 0.062 0.077 0.089 0.123 3.580 

 

Test 1c: 8-thread ingest test

 

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
  0.71400   3.46600   5.09440  14.32520  23.55560  72.11128 632.71200 
2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k
584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps
Tue Nov 19 19:09:08 PST 2013 : 16693 objects
 
1.031 objects/s  (objects per second)

 

 

Test 1d: Multi-threaded iteration test

Code Block
4 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.011 0.013 0.014 0.017 0.020 0.031 0.073 
Tue Nov 19 14:08:44 PST 2013 : retrieving all objects
160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k
0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps
Tue Nov 19 14:11:26 PST 2013 : 16693 objects
Tue Nov 19 14:11:26 PST 2013 : done
103 objects/s  (objects per second)
 
8 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.011 0.022 0.025 0.031 0.034 0.045 0.093
Tue Nov 19 14:05:10 PST 2013 : retrieving all objects
159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k
0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps
Tue Nov 19 14:07:39 PST 2013 : 16693 objects
Tue Nov 19 14:07:39 PST 2013 : done
112.1 objects/s  (objects per second)
 
16 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.012 0.024 0.028 0.035 0.040 0.052 0.149 
Tue Nov 19 14:11:50 PST 2013 : retrieving all objects
161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k
0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps
Tue Nov 19 14:14:20 PST 2013 : 16693 objects
Tue Nov 19 14:14:20 PST 2013 : done
110.9 objects/s  (objects per second)

 

Test 2: Simple Ingest into Fedora 4

Ingest all the data into fcrepo4 as Glossary binaries on Glossary containers.

Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071

...

Test 2b: Ingest all the data as containers arranged in a druid tree

 

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
   0.6590    6.1800    9.1844   24.7972   44.7202  226.7644 1094.8120 

 

Ingest speed over time

Test 2c: Ingest all the data as containers in a druid tree AND use fcr:batch

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
      0%      50%      70%      90%      95%      99%     100% 
  0.4670   5.2440   7.8554  20.3558  33.8186 101.2618 711.0130 

 

Test 2d: Use a 4-node cluster to do a druid-tree ingest

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
   0.9050    9.7360   13.9442   29.6692   48.0332  146.0208 1109.1760 


Test 3: Realistic Ingest into Fedora 3

...

Test 4: Realistic Ingest into Fedora 4

  • add RDF as properties on Glossary resources
  • Each page as a ordered same-name sibling on an container 

...