1. Extra-repository access control

Issue

We want to apply access controls (ideally WebACs) not only to repository objects, but also to its indexes.

E.g. access to resource http://myrepo.edu/private/res1 is controlled by access policies in the Fedora repo but these policies are not honored in a triplestore or Solr index where metadata can be accessed by anyone.

We want policies to be stored in one place (Fedora) and re-used for both repo and index access.

Proposed solution

Enable single-point access to both repo resources and indexes, e.g.

http://myrepo.edu/fcr:search/triplestoreIndex01/ (where the query string can be either appended as GET parameter or POST payload)

Restrict direct access of clients to indexes via firewall rules to avoid bypassing security.

2. Content models

Issue

We want to be able to CRUD complex resources with a single or a minimal set of requests, without performing complex tasks on the client side (and rewriting the same implementation for each client).

E.g. User sends metadata and binary files via a multipart form POST request to http://myrepo/fcr:model/myns:Image/ ; multiple resources and relationships between them are created according to a content model configuration for the “myns:Image” resource type.

For the “myns:Image” model the service configuration would define:

  • GET:

    • return a representation of the resource and related resources according to a “default” transform program defined for myns:Image

    • This may include complex networks involving SPARQL or Solr queries on the indexes.

  • POST:

    • Create a new UID from an external UID minter service;

    • create a pcdm:Object resource and populate it with metadata provided by a JSON object in the “metadata” POST field;

    • add the new generated UID;

    • assign the resource the “myns:Image” rdf type;

    • create pcdm:Object resource and assign it a “myns:Instance” rdf type;

    • create a “myns:hasOriginalInstance” relationship between the myns:Image and the myns:Instance resources;

    • Create a LDP-NR resource from the “original” bitstream provided by POST

    • Create a “pcdm:hasFile” relationship between the myns:Instance and the LDP-NR

    • Repeat this for other possible binaries provided by the request

    • Optionally wrap the whole operation in a transaction and roll back if a step fails.

Also, we want to establish a RDFS-like type and sub-type hierarchy which can be reflected in the indexes.

E.g.

  • I define myns:Image as a subclass of myns:Asset;

  • I define myns:hasOriginalInstance as a sub-property of myns:hasInstance;

  • using the resource created above as an example, I should be able to search for all myns:Asset resources and find the myns:Image I created;

  • similarly, if I search for all the myns:hasInstance relationships for the image above, I should be able to discover the resource related by myns:hasOriginalInstance.

Proposed solution

Create a configurable service that defines content models and the actions associated to the various HTTP methods that can be performed on each content model.

3. Data and structural validation

Issue

We want to enforce input validation outside of individual client systems. This is related to #2.

This validation may include constraints for property domain and range, cardinality, uniqueness, etc.

Range validation should include both data types for literal properties and class constraints for in-repo resource properties.

E.g.

  • restrict the “myns:created” property to xsd:dateTime;

  • restrict the “myns:hasInstance” property to resources of type “myns:Instance” or its subtypes (structural validation);

  • make myns:uid mandatory.

Proposed solution

I think that the RDFS/OWL syntax lends itself as a framework for a configuration file that defines all these validation rules.

We can define which subset of RDFS/OWL the service supports and enforces.

E.g. the example validation rules above would be expressed as OWL statements:

[Prefix declarations]
 
myns:created rdfs:range xsd:dateTime .
myns:hasInstance rdfs:range myns:Instance .
_:r1 rdf:type owl:Restriction ;
  owl:onProperty myns:uid ;
  owl:cardinality "1"^^xsd:nonNegativeInteger .

 

This syntax could also be used for the content model configuration in #2.

A service would be written to parse this syntax and translate it into validation actions.

 

9 Comments

  1. Doron Shalvi, is #3 what you were looking for? 

    1. Stefano Cossu, yes, #3 would seem to handle validation of our data against our content models, thanks!  Separately, it would also be nice to have machinery for managing how often these checks are performed and where the results are stored (auditing I imagine).

  2. This is a serious misuse of OWL. OWL is absolutely not usable for validation, in ordinary form. The canonical semantics for OWL preclude any action except the materialization of new triples. For an alternative use, you will have to alter the semantics radically. People have done this, and there are other ways to do validation, but out-of-the-box OWL is definitely not the way to do this.

    1. Adam, thanks for the references. 

      The reason why I suggested to use a subset of OWL to define constraints is that it is a familiar and expressive syntax. The fact that it would be used to enforce restriction instead of inferring information can, I admit it, be confusing. 

      A custom validation vocabulary using RDF syntax is possible; e.g. translating the examples in #3: 

      myns:created val:hasRange xsd:dateTime .
      myns:hasInstance val:hasRange myns:Instance .
      myns:uid val:exactCardinality "1"^^xsd:nonNegativeInteger .

      In any case, my point to find a configuration syntax that expresses the current validation use cases and is expandable to further ones. 

      1. For JHU Data Conservancy-related use cases, SPIN appeared to be the most useful off-the-shelf technology for defining and enforcing constraints.  (incidentally, there is a set of SPIN constraints out there that can be used to apply closed-world semantics to OWL for the purpose of validation).  The Data Shapes working group is quite active right now, though it looks like it may be a while before they reach agreement on an initial public draft.

        So I think it would be useful to allow some choice in deploying different validation strategies, as validation of RDF is still at a volatile stage right now.

         

        1. Aaron, thanks for the references to SPIN and Shapes. They both seem to be very flexible languages since they are as expressive and familiar as SPARQL can be. I would definitely consider them as candidates. 

          OT: I wish WebAC would evolve toward a similar model too, providing the flexibility of XACML with a more manageable language as SPARQL...
        2. SPIN is interesting, but it is essentially only a TopQuadrant product. I know of no other implementation, particularly no open source implementation. The SHACL standard should eventually be better supported, but right now I know of no plans for any production implementation at all. My point is that as useful as notions of validation are, it may be considerably too early to commit Fedora as a community to any plans.

          1. While there are no other implementations of it that I'm aware, the core of SPIN is open sourced (apache2), and their IDE has a useful free version that they don't talk about much. 

            Despite the apparent impending obsolescence of currently available validation technologies, it is a business need that people do have.  Some institutions may view banding together on this topic (or at least exploring it together) to be worth the risk.

             

            1. Well, as far as goes committing to a particular technology, I would be looking for multiple open source implementations. But more importantly, I don't mean to prevent any institutions from banding together in the way you describe. I'm just concerned to avoid putting a community imprimatur on any such effort.