Test corpora

Test Data

Set 1: Digital Corpora govdocs1

Set 2: OpenPlanets

Set 3: Random binary data created from a stable set of filesizes

The govdocs dataset includes (…), (…some characteristics, e.g. N PDF documents, varying in size from X to Y)

The OpenPlanets dataset …

(Description of fixture processing, generation of bagits)

The generated binary data set

The set is created by the script https://github.com/futures/ff-fixtures/blob/master/create_random_files.sh which writes the files to objects/random and creates the necessary manifest.txt file used by the JMeter Tests at https://github.com/futures/ff-jmeter-madness

It works by using some standard GNU commands including dd, rm and iterates over a list of integer filesizes in https://github.com/futures/ff-fixtures/blob/master/random_sizes.data in order to create one file per iteration of the given size in megabytes. This to a certain extend ensures the comparability of the measurements, since exactly the same number of files with the same number of bytes is created each time the data set is generated with the same input file.

In order to create the binary test data set checkout the project https://github.com/futures/ff-jmeter-madness first:

git clone https://github.com/futures/ff-jmeter-madness

then init and update the submodules:

git submodule init && git submodule update

this will checkout the submodule fixtures containing the script create_random_data.sh.

Switch to the fixtures subdirectory:

cd fixtures

and run the script:

./create_random_data.sh

this will create the directory objects/random and, using dd, create the random binaries as objects/random/random_N.data.

Additionally a file manifest.txt is generated which is employed by the JMeter tests to find the random binaries for uploading them via HTTP requests

Now you can fire up JMeter and open the JMX file containing the test plan.

Generating random file sizes with a gaussian distribution

When you want to create a different input file for the file sizes this can be done in Octave/Matlab:

octave:1> x = round((stdnormal_rnd(1,100) * 128) + 256);

with 1 being the standard deviation, 100 is the number of samples, 128 is the scaling factor of the sample size and finally 256 (twice the scaling factor) pushes the whole function to the right so that 1 >= x >= 512

In order to create a larger set of file sizes with a larger median file size you could use stdnormal_rnd(1,500) * 256) + 512. This will create 500 file size entries between 1 and 1024.

Then save the generated vector into a file

octave:2> save "file_sizes.txt" x;

The data file needs some postprocessing (remove comments, insert line breaks). This can be achieved with some standard GNU tools:

sed '/^\#/d' file_sizes.txt | tr " " "\n" | sed '/^$/d' > tmp.txt

mv tmp.txt file_sizes.txt