Description

DataStaR : Data Staging Repository

"_The purpose of DataStaR is to support collaboration and data sharing among researchers during the research process, and to promote publishing or archiving data and high-quality metadata to discipline-specific data centers, and/or to Cornell's own digital repository." (see DataStaR: An Institutional Approach to Research Data Curation)

Requirements

Accessioner's Workbench Requirements

  • Observation data in tabular form – CSV files for initial implementation.
  • Small to medium scale datasets.
  • Original dataset may or may not be stored in Fedora.
  • Data normalization and cleansing operations to be applied to data.
  • Normalization and cleansing operations should be reusable.
  • Processed datasets will be stored in Fedora.
  • Results must be repeatable.

Implementation Assumptions

  • Selecting data operations and execution parameters requires human intervention.
  • The processing rate is unimportant.
  • Ingest into Fedora is controlled by content models.
  • Simple workflow model – see Visio Flow Diagram

Some other requirements

  • Retangularity - this is very interactive which makes it a poor fit for Kepler
  • Column headings - again, listing problem headings is not an issue, but "allow edits" is too interactive
  • Data quality control - Kepler can certainly create the histograms or scatter plots of the data, but there wouldn't be the capability to select data values and correct them interactively.

Kepler Workflows

Kepler workflows were developed to illustrate how Kepler might be used as an accessioner's workbench.

FCRepoDateNormalizer

Retrieves a CSV datastream from an object in a Fedora Repository, processes date columns to standardize their format and saves the results to a local file in CSV format.

Screenshots :

Source file :

FCRepoDateNormalizer.xml

FCRepoLatLongSplitter

Retrieves a CSV datastream from an object in a Fedora Repository, splits columns containing both latitude and longitude coordinates into two separate columns and saves the results to a local file in CSV format.

Screenshots :

Source file :

FCRepoLatLongSplitter.xml

FCRepoLatitudeNormalizer

Retrieves a CSV datastream from an object in a Fedora Repository, processes latitude columns to standardize their format and saves the results to a local file in CSV format.

Screenshots :

Source file :

FCRepoLatNormalizer.xml

FCRepoLongitudeNormalizer

Retrieves a CSV datastream from an object in a Fedora Repository, processes longitude columns to standardize their format and saves the results to a local file in CSV format.

Screenshots :

Source file :

FCRepoLongNormalizer.xml

Kepler Actors

A number of new actors were created that provide data access and accessioning functionality.

The PythonActor was used extensively in this project. It is based on the Jython interpreter which provides standard Python functionality within a Java application. Because Jython is implemented in Java, it also provides access to any Java class or class library available to the JVM. In this way, it provides a rapid prototyping tool that supports coding in both Java and Python.

Change Log Writer Actor

This actor writes a log file with a summary of changes made during the latest run of the workflow.

Source file :

ChangeLogWriterActor.py

Input port :
  • changes : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a list of changes made.
  • filename : StringToken containing the fully qualified path for the change log file.
Other inputs :

The "Change Log Writer" script also needs the fully qualified path for the change log file. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'action'.
    OR
    A port on the PythonActor named 'action' containing a StringToken.

Output ports :
  • None

The "Change Log Writer" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer
  • FCRepoLatitudeNormalizer
  • FCRepoLongitudeNormalizer
  • FCRepoLatLongSplitter

CSV Datastream Dissemination Actor

This actor performs several functions.

  1. It displays a form dialog that allows the user to enter parameters required to connect to a Fedora Repository.
  2. It connects to a Fedora Repository and extracts a datastream dissemination from the object specified in the form.
  3. It breaks the datastream into rows and send them, on-at-a-time, to the next actor in the worklflow.
Screenshot :

Required Packages :
  • JyFedoREST
  • FCRepoKepler - Uses SimpleHTMLFormDialog to display and manage the form.
Source file :

CSVDatastreamDisseminationActor.py

Input ports :
  • None
Output port :
  • dissemination : StringToken containing a single row from the CSV datastream.

The "CSV Datastream Dissemination" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer
  • FCRepoLatitudeNormalizer
  • FCRepoLongitudeNormalizer
  • FCRepoLatLongSplitter

Error Log Writer Actor

This actor writes a log file with a summary of errors encountered during the latest run of the workflow.

Source file :

ErrorLogWriterActor.py

Input ports :
  • error: ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a list of errors encountered.
Other inputs :

The "Error Log Writer" script also needs the fully qualified path for the error log file. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'action'.
    OR
    A port on the PythonActor named 'action' containing a StringToken.

Output ports :
  • None

The "Error Log Writer" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer
  • FCRepoLatitudeNormalizer
  • FCRepoLongitudeNormalizer
  • FCRepoLatLongSplitter

Normalize Date Actor

This actor looks at each column that contains a date value and removes extraneous time data when present.

Required Packages :
  • FCRepoKepler - script is based on the RowAnalyzer class in fcrepo.kepler.RowAnalyzer.
Source file :

NormalizeDateActor.py

Input port :
  • input : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. an ordered list of columns in the row.
Other inputs :

The "Normalize Date" script also needs a list of indexes for the columns that contain dates. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'indexes'
    OR
    A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, dates occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.

Output port :
  • output : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

The "Normalize Date" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Normalize Latitude Actor

This actor looks at each column that contains a latitude value and assures that all values are valid and in the same format.

Required Packages :
  • FCRepoKepler - script is based on the RowAnalyzer class in fcrepo.kepler.RowAnalyzer.
Source file :

NormalizeLatitudeActor.py

Input port :
  • input : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. an ordered list of columns in the row.
Other inputs :

The "Normalize Latitude" script also needs a list of indexes for the columns that contain latitudes. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'indexes'
    OR
    A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, latitudes occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.

Output port :
  • output : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

The "Normalize Latitude" PythonActor is used in the following workflows:

  • FCRepoLatNormalizer

Normalize Longitude Actor

This actor looks at each column that contains a longitude value and assures that all values are valid and in the same format.

Required Packages :
  • FCRepoKepler - script is based on the RowAnalyzer class in fcrepo.kepler.RowAnalyzer.
Source file :

NormalizeLongitudeActor.py

Input port :
  • input : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. an ordered list of columns in the row.
Other inputs :

The "Normalize Longitude" script also needs a list of indexes for the columns that contain longitudes. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'indexes'
    OR
    A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, longitudes occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.

Output port :
  • output : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

The "Normalize Longitude" PythonActor is used in the following workflows:

  • FCRepoLongNormalizer

Output Prep Actor

This actor sorts through the output created by a RowAnalysis script and routes the data to the proper output writer.

Source file :

OutputPrepActor.py

Input port :
  • input : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.
Other inputs :

The "Output Prep" script also needs the character to be used as a separator between columns in the output text string. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'separator'
    OR
    A port named 'separator' containing a StringToken.

Output ports :
  • output : StringToken containg a string representing the ouput row in a CSV file. It is constructed by concatenating the values in the columns array using a separator character.
  • changes : ObjectToken containing a tuple with 2 items :
    1. the current row number as an integer.
    2. the tuple/array of changes made received on the input port.
  • errors : ObjectToken containing a tuple with 2 items :
    1. the current row number as an integer.
    2. the tuple/array of errors encountered received on the input port.

The "Output Prep" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer
  • FCRepoLatitudeNormalizer
  • FCRepoLongitudeNormalizer
  • FCRepoLatLongSplitter

Row To Columns Actor

This actor splits a text string representing a 'row' into 'columns' using a separator character such as ','.

Source file :

RowToColumnsActor.py

Input port :
  • row: StringToken containing a string representation of a single row in a spreadsheet or other data matrix.
Output port :
  • columns - ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a tuple containing an ordered list of values for each column in the row.

The "Row To Columns" script also requires the PythonActor to have a parameter named 'separator' that contains the character that was used as a separator between columns in the input text string.

The "Row To Columns" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer
  • FCRepoLatitudeNormalizer
  • FCRepoLongitudeNormalizer
  • FCRepoLatLongSplitter

Split Lat/Long Actor

This actor looks for columns that contain both latitude and longitude coordinates and splits them into two separate columns. It leaves the latitude in the original column and moves the longitude to a new column immediately next to the latitude.

Required Packages :
  • FCRepoKepler - script is based on the RowAnalyzer class in fcrepo.kepler.RowAnalyzer.
Source file :

SplitLatLongActor.py

Input port :
  • input : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. an ordered list of columns in the row.
Other inputs :

The "Split Lat/Long" script also needs a list of indexes for the columns that contain lat/long coordinates. This can be acquired in one of two ways:
    A string parameter on the PythonActor named 'indexes'
    OR
    A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, coordinates occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.

Output port :
  • output : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

The "Split Lat/Long" PythonActor is used in the following workflows:

  • FCRepoLatLongSplitter

  • No labels