Test Data

 

Set 1: Digital Corpora govdocs1

 

Set 2: OpenPlanets

 

Set 3: Random binary data created from a stable set of filesizes

 

 

 

The govdocs dataset includes (…), (…some characteristics, e.g. N PDF documents, varying in size from X to Y)

 

 

 

The OpenPlanets dataset …

 

 

 

(Description of fixture processing, generation of bagits)

 

The generated binary data set

 

The set is created by the script https://github.com/futures/ff-fixtures/blob/master/create_random_files.sh which writes the files to objects/random and creates the necessary manifest.txt file used by the JMeter Tests at https://github.com/futures/ff-jmeter-madness

 

It works by using some standard GNU commands including dd, rm and iterates over a list of integer filesizes in https://github.com/futures/ff-fixtures/blob/master/random_sizes.data in order to create one file per iteration of the given size in megabytes. This to a certain extend ensures the comparability of the measurements, since exactly the same number of files with the same number of bytes is created each time the data set is generated with the same input file.

 

In order to create the binary test data set checkout the project https://github.com/futures/ff-jmeter-madness first:

 

git clone https://github.com/futures/ff-jmeter-madness

 

then init and update the submodules:

 

git submodule init && git submodule update

 

this will checkout the submodule fixtures containing the script create_random_data.sh.

 

Switch to the fixtures subdirectory:

 

cd fixtures

 

and run the script:

 

./create_random_data.sh

 

this will create the directory objects/random and, using dd, create the random binaries as objects/random/random_N.data.

 

Additionally a file manifest.txt is generated which is employed by the JMeter tests to find the random binaries for uploading them via HTTP requests

 

Now you can fire up JMeter and open the JMX file containing the test plan.

 

Generating random file sizes with a gaussian distribution

 

When you want to create a different input file for the file sizes this can be done in Octave/Matlab:

 

octave:1> x = round((stdnormal_rnd(1,100) * 128) + 256);

 

with 1 being the standard deviation, 100 is the number of samples, 128 is the scaling factor of the sample size and finally 256 (twice the scaling factor) pushes the whole function to the right so that 1 >= x >= 512

 

In order to create a larger set of file sizes with a larger median file size you could use stdnormal_rnd(1,500) * 256) + 512. This will create 500 file size entries between 1 and 1024.

 

Then save the generated vector into a file

 

octave:2> save "file_sizes.txt" x;

 

The data file needs some postprocessing (remove comments, insert line breaks). This can be achieved with some standard GNU tools:

 

sed '/^\#/d' file_sizes.txt | tr " " "\n" | sed '/^$/d' > tmp.txt 

 

mv tmp.txt file_sizes.txt

 

Fig. 1: Histogram of the current data setFig. 2: Histogram of the reduced random data set

 

EDRM File Format Data Set

“data-set” folder containing 158 folders with 381 files

http://www.edrm.net/resources/data-sets/edrm-file-format-data-set

 

Apache Mahout's meta collection

https://cwiki.apache.org/confluence/display/MAHOUT/Collections

  • No labels