Title (Goal)Content and structural validation
Primary Actor Information architect, developer
ScopeComponent
LevelSummary  
AuthorStefano Cossu
Story (A paragraph or two describing what happens)Enable validation of content structure and properties

 

I want to enforce input validation outside of individual client systems. This is related to the Content modeling use case.

This validation may include constraints for property domain and range, cardinality, uniqueness, etc.

Range validation should include both data types for literal properties and class constraints for in-repo resource properties.

Examples

  • Restrict the “myns:created” property to xsd:dateTime;

  • Restrict the “myns:hasInstance” property to resources of type “myns:Instance” or its subtypes (structural validation);

  • Make myns:createdDate single-valued (i.e. cardinality = 0..1)
  • Make myns:uid mandatory and single-valued (i.e. cardinality = 1..1)

  • Inherit property constraints from super-types
    • type myns:Document has a property definition for myns:uid as mandatory and myns:content as single-valued
    • type myns:TextDocument inherits these definitions
    • type myns:ImageDocument inherits myns:uid but overrides myns:content to being multi-valued
  • Ensure that no two resources with the same myns:uid are present in the repository (similar to a unique key constraint in a relational database)

Roles of the API Extension Architecture configuration

  • Define HTTP methods that the Validation extension operates on
  • Enable/disable and define execution priority of Validation service (e.g. pipeline)

Roles of the API Extension engine

  • Forward data from user request or previous services
  • Handle response from Validation extension and forward to further services

Roles of the Validation extension configuration

  • Define content models which validation is performed on
    • For each model, define properties to be validated
    • For each property, define:
      • validation rules
      • RDF type of the resource or container that should be validated (optional)
      • Post-success actions (i.e. services to be called if all property values pass validation) (optional)
      • Post-failure actions (i.e. services to be called if any property value fails validation) (optional)

Roles of the Validation extension engine

  • Parse input from API-X engine and determine content model(s) of resource
  • Parse configuration for content model(s)
  • If RDF type restriction (see point in config roles) are defined, query repo index to determine if the resource if its container are of the RDF type specified in the config
  • Loop over property validation rules in config file:
    • If an RDF type restriction (see point in config roles) is defined, query repo or index to determine if the resource or its container are of the RDF type specified in the config
      • If result is positive, apply validation rule against user-provided property value(s)
      • If negative, skip validation
    • If no RDF type restriction is defined, apply validation
    • If validation passes:
      • if a post-success action is defined, execute it
      • if no post-success action is defined, move on to next rule
    • If validation fails:
      • if a post-failure action is defined, execute it
      • if no post-failure action is defined, abort whole process and raise an exception
  • After all rules have been parsed, return to API-X engine

Note

Due to the long discussion in the comments, this use case has been split into three child pages. See links below.

25 Comments

  1. Hi Stefano Cossu, can you describe what you mean by "in-repo resource properties" above?

    1. I mean property values that are URIs of resources living in the same repository.

      E.g.if I have

      <https://myrepo.org/res1> pcdm:hasRelatedFile <https://myrepo.org/res2>

      I want to be able to specify that https://myrepo.org/res2 must have a rdf:type of myns:Image.

  2. Though not stated explicitly here, I can see two related or sub- use cases in this example:

    1. Validating the contents of a Fedora resource as prerequisite to a CRUD operation, enacting some policy in response to validation success/failures (e.g. reject the CRUD operation if invalid, allow it but flag the object as invalid/incomplete, etc)
    2. Exposing a validation service to report on the validity of an object already in the repository

     

    For #1 (CRUD), one could imagine the following roles and responsibilities

    • API Extension architecture
      • Intercept incoming POST requests to containers for whom this extension provides validation services when creating new objects.  
        • "containers for whom this extension provides validation services" can possibly be determined by rdf:type.  (e.g. a container with type of cma:validatable will have its members validated.)
      • Intercept PUT and PATCH requests to resources within a container for whom this extension provides validation services
      • Direct intercepted requests to a validation service defined by the extension
      • Forward the response back to the client
    • Validation extension
      • Define/choose a specification for representing validation constraints (SHACLSPIN, Closed-world assumption OWL, etc) or specification for representing a model that can be validated against.
      • Define/choose a specification for indicating which constraints apply to a particular validatable object (perhaps a relationship <> <cma:hasValidationConstraints> <constraints> or <> <cma:hasModel> <model>)
      • Define/choose a specification for representing policy when validation fails (i.e. reject request, flag resource as 'invalid', etc)
      • Validate the resource relevant to the POST, PUT, or PATCH request, according to the indicated constraints.
      • Allow Fedora to complete the request if validation succeeds, and produce/forward an appropriate response
      • Enact specified policy if validation does not succeed, and produce an appropriate response.

    For #2 (validation service) one could imagine the following roles and responsibilities:

    • API Extension architecture
      • Expose a URI (object path + segment, e.g. /path/to/object/cma:validation) to the validation service, for objects that can be validated
        • "objects that can be validated" may possibly be determined by rdf:type, e.g. rdf:validatable as above.
      • Forward requests and replies to/from the validation service extension implementation.
    • Validation Extension
      • Define specifications for representing validation constraints, and binding a set of constraints to an object (as above)
      • Define specification for a validation request (simple GET on URI, perhaps additional parameters indicating aspects to ignore or include in validation)
      • Define a specification for representing validation results
      • Validate the resource when requested, produce a validation results and return a representation of them

    Stefano Cossu: Is the above consistent with your conceptualization of how validation may be realized in the context of API extensions?

     

     

     

    1. Though not stated explicitly here, I can see two related or sub- use cases in this example:

      1. Validating the contents of a Fedora resource as prerequisite to a CRUD operation, enacting some policy in response to validation success/failures (e.g. reject the CRUD operation if invalid, allow it but flag the object as invalid/incomplete, etc)

      Yes. Basically validation would be just another service that you call on CUD operations and that can determine the lifecycle and outcome of the operation.

      1. Exposing a validation service to report on the validity of an object already in the repository

      If validation is extended to all PUT, POST, PATCH and DELETE operations, what is the reason for checking the validity of an existing resource when you are not modifying it?

      For #1 (CRUD), one could imagine the following roles and responsibilities

      • API Extension architecture
        • Intercept incoming POST requests to containers for whom this extension provides validation services when creating new objects. 

      Should be also PUT, PATCH and DELETE, i.e. any HTTP request meant to change the state of the resource.



          • "containers for whom this extension provides validation services" can possibly be determined by rdf:type.  (e.g. a container with type of cma:validatable will have its members validated.)

      You could also just define validation for arbitrary rdf types, so you don't need to assign an extra rdf type to your resources. Also just a "cma:validatable" type would not be enough to specify which validation to perform.


        • Intercept PUT and PATCH requests to resources within a container for whom this extension provides validation services

      It may be useful to abstract the concept of "container" away from the API-X. Let's say you want to create an "image" resource: you POST an image file and some metadata and say, "this is an image", and the API-X validates your input and creates containers and binary resources for you.


        • Direct intercepted requests to a validation service defined by the extension
        • Forward the response back to the client
      • Validation extension
        • Define/choose a specification for representing validation constraints (SHACLSPIN, Closed-world assumption OWL, etc) or specification for representing a model that can be validated against.

      +1 - We will need further research and discussion on this topic once we get to the details.


        • Define/choose a specification for indicating which constraints apply to a particular validatable object (perhaps a relationship <> <cma:hasValidationConstraints> <constraints> or <> <cma:hasModel> <model>)
        • Define/choose a specification for representing policy when validation fails (i.e. reject request, flag resource as 'invalid', etc)
        • Validate the resource relevant to the POST, PUT, or PATCH request, according to the indicated constraints.
        • Allow Fedora to complete the request if validation succeeds, and produce/forward an appropriate response
        • Enact specified policy if validation does not succeed, and produce an appropriate response.

      For #2 (validation service) one could imagine the following roles and responsibilities:

      • API Extension architecture
        • Expose a URI (object path + segment, e.g. /path/to/object/cma:validation) to the validation service, for objects that can be validated
          • "objects that can be validated" may possibly be determined by rdf:type, e.g. rdf:validatable as above.
        • Forward requests and replies to/from the validation service extension implementation.
      • Validation Extension
        • Define specifications for representing validation constraints, and binding a set of constraints to an object (as above)
        • Define specification for a validation request (simple GET on URI, perhaps additional parameters indicating aspects to ignore or include in validation)
        • Define a specification for representing validation results
        • Validate the resource when requested, produce a validation results and return a representation of them

      See above - I cannot see a use case for this.

      Stefano Cossu: Is the above consistent with your conceptualization of how validation may be realized in the context of API extensions?

      Yes!

       

            • "containers for whom this extension provides validation services" can possibly be determined by rdf:type.  (e.g. a container with type of cma:validatable will have its members validated.)

        You could also just define validation for arbitrary rdf types, so you don't need to assign an extra rdf type to your resources. Also just a "cma:validatable" type would not be enough to specify which validation to perform.

        I see two separate concerns here:

        (1) Identifying which objects in the repository can be bound to a validation extension/service that may operate on them.  (The assumption here is that a repository can contain objects that it cannot validate, and/or objects where validation is not wanted or desired; hence the need to bind objects to a validation service wherever validation is desired.  This is analogous to the existing practice of marking objects with predicate http://fedora.info/definitions/v4/indexing#Indexable in cases where the object is intended to be indexed.)

        (2) Identifying which model an object should be validated against

         

        So (1) is a concern of the API extension architecture, as it needs to know when it has to invoke a particular extension (without knowing any details of how that extension works or is configured).  This is a general capability.  I was suggesting that "presence of a specific rdf type" could be a sufficiently general means to bind the API extension Architecture to a particular service.  In particular, presence of 'cma:validatable' would bind a validation extension to objects so marked.

        Item (2) is a concern of the internal operation or specification of the validation extension, and as you note is also necessary information.  The API extension architecture itself does not need to know or care how this is achieved.

        1. Aaron,

          If we suppose that:

          • every Fedora resource is assigned a content model (a default one if this is unspecified),
          • this content model is an RDF type belonging to a special set of classes (e.g. cma:*) or the value of a special property (e.g. fedora:contentModel, maybe easier to manage)

          Then you could use a configuration that specifies validation rules or validation service bindings for all content models. If the content model of the resource you want to validate is not in the configuration or no rules are specified, no validation happens.

          Does that address both your concerns?

          1. So in this example, would the API extension architecture always invoke the validation extension, and the validation extension determines, based upon its configuration, the object's stated model, etc.,  whether it performs validation?  

            See my comment to Elliot.  My concern is really that I'd like to explore ways to avoid enabling an extension globally for an entire repository, in cases where it makes sense to do so.  Likewise, let's suppose an object has a content model that the validation extension is appropriately configured to know how to validate.  Is it reasonable to have a scenario where a repository manager might still not want it validated, depending on where it is put in the repository?  

            I think these validation use cases are very interesting from the perspective of binding extensions to objects.  Ideally (in my opinion) the API extension architecture itself should know as little as possible about a particular extension's business rules;  the mechanism of binding an extension to an object needs to be simple, easy to reason about, and have a path forward to a fast and efficient implementation.

            1.  Is it reasonable to have a scenario where a repository manager might still not want it validated, depending on where it is put in the repository?  

              I think this depends on what the repo manager is trying to achieve and what you mean by "where" (see my other comment).

              I can see three different scenarios depending on this conversation: one where validation is mandatory across the whole repository (my case); one where it is recommended but not mandatory, at least for a temporary situation (Elliot's case here); and one where it is mandatory or recommended only in one part of the repository (your case above).

              Can this summarize our discussion on this point?

      1.  

          • Intercept PUT and PATCH requests to resources within a container for whom this extension provides validation services

         

        It may be useful to abstract the concept of "container" away from the API-X. Let's say you want to create an "image" resource: you POST an image file and some metadata and say, "this is an image", and the API-X validates your input and creates containers and binary resources for you.

        Want to +1 this statement.  It would be useful to be able to validate a domain-specific representation of an object before it was mapped into LDP.

            • Intercept PUT and PATCH requests to resources within a container for whom this extension provides validation services

           

          It may be useful to abstract the concept of "container" away from the API-X. Let's say you want to create an "image" resource: you POST an image file and some metadata and say, "this is an image", and the API-X validates your input and creates containers and binary resources for you.

          Want to +1 this statement.  It would be useful to be able to validate a domain-specific representation of an object before it was mapped into LDP.

          Fedora's 4 object model is, for better or for worse, a hierarchy.  Creating a new object in Fedora 4 inherently involves adding it as a child of some resource that already exists in the repository.  So technically I should have used the word 'resource' rather than 'container' so as not to bring LDP into the mix.

          Given that clarification, my comment "container for whom this extension provides validation services" is related to my desire to be able to specify in some way when to invoke the validation extension, and when not to. 

          So when depositing a new object, the API extension architecture needs to answer the question "which extensions do I need to invoke in order to handle this request".  I believe the answer to that question shouldn't have to be "always invoke the validation extension".  Ideally, in my opinion the API extension architecture should allow the answer to be "invoke the validation extension for the subset of the repository for which I want validation".

          Because Fedora is inherently hierarchical, and because depositing a new object inherently involves specifying a parent resource to create a new child underneath, I was thinking that the parent resource would a natural place to have a marker to indicate "please validate objects put here".  This marker could be a cma:validatable rdf:type.

          So if a repository manager's policy is to validate all objects, then place the marker on the root node in the repository.  If the manager's policy is to validate objects in /public/images, place a marker there.  This is where I was going with "container for whom this extension provides validation services"  

           

          1. Fedora's  internal structure is hierarchical indeed; however that is the JCR layer that should ideally not be exposed to the client. As of late tying functionality to the JCR hierarchy is generally being avoided (the document you quote probably pre-dates the decision to move to a full LDP-based model and abstract the JCR machinery if I remember correctly, right Andrew Woods?). LDP has the concept of containment that would be more appropriate for your case.

            Back to the main point, it seems like the main debate here is having validation depending on containment (or hierarchy, if you don't agree with the above statement) or RDF class or similar membership. I think this is a good discussion to bring forward when we start laying out an implementation plan.

            1. Yes, the hierarchical nature of F4 should be viewed through the lens of LDP containment... which does not seem to invalidate Aaron Birkland's "containment" proposal.

            2. This raises an interesting point that will likely have to be addressed over the course of this work.  It is a little unclear to me how much of JCR is 'accidental' (i.e. an implementation detail), and how much of it is 'essential' to the Fedora 4 model.  fcrepo-kernel-* contains the the public Java API to Fedora core.  Many current extension modules use this API.  This API exposes Fedora objects/resources explicitly in terms of JCR (see, for example, FedoraResource.java from the kernel API).  The only implementation of this API is based on Modeshape, and I believe at present there isn't any way to use fcrepo-kernel without actually being deployed as part of the Fedora webapp.

              So where does this leave API extension modules?  To a great extent, they may rely on the HTTP+LDP API of Fedora.  In my mind, though, it may be useful to look to fcrepo-camel (which provides a client API based on HTTP), or even fcrepo4-client.  It might  be nice to be able to recommend to extension developers a client library that exposed Fedora 4's conceptual model free of JCR or HTTP concepts that could have implementations based on fcrepo-kernel or fcrepo-camel.client client; depending on where a particular extension is deployed.

              1. I agree that we would want to recommend patterns for writing extensions, including recommending client libraries.

                If the Extensions are "air-gapped" from Fedora (e.g. not running in the same JVM, or running in a separate servlet container or web-app), then it seems unlikely that they would have access to APIs in fcrepo-kernel-*.  If Extensions did have access to those APIs, that would make me uncomfortable because that increases coupling between Extensions and Fedora (so the coupling, regardless of the exposure of JCR, is what makes me uncomfortable in that scenario).

                So I agree that we would want another integration point as you mention above: HTTP/LDP, Camel+HTTP, or another existing library.

  3. Aaron BirklandStefano Cossu: I had made these notes for roles of this use case.    I think they align with everyone's thoughts.  Some notes:

    • I didn't consider what to do with invalid objects; I wonder if valid objects should have some event emitted and exposed in a provenance stream
    • While rdf:type is one way (and an intuitive one at that) to associate the objects being validated with a validation service, I think we are in agreement that we prefer to think of a more abstract policy concept; a policy is determining what objects are being validated by what service, and rdf:type is one possible implementation path.
    • Differentiate between native support for constraint languages vs allowing developers to plug in their own particular ad hoc validation services
    • Other than ingest and invocations of a validation service, are there other times when validation might be performed?  E.g. when retrieving an object or when storing a copy of the object?  If so, maybe a consideration for some type of policy that determines when validation is invoked (on 'ingest', on 'copy').

    API Extensions Architecture

    1. Provides - or proxies the request to - a runtime environment for validating an instance of a model against constraints that:

      1. supports configuration or policy that governs whether an instance is subject to validation, and when (upon ingesting an object, upon retrieving an object) validation is performed

      2. provides native support for certain constraint languages like SPARQL Inferencing Notation (SPIN) or Shapes Constraint Language (SHACL)?

      3. supports a plugin architecture for validating model instances on a per-model basis?

      4. provides access to the result of validation attempts

      5. optionally generates validation events and stores them as provenance for the object being validated?

    2. May provide - or proxy - a service which can validate an object on request (e.g. a request to /path/to/object/svc:validate)

      1. because an object may conform to multiple models, the requestor may be allowed to specify the type of model to validate against

    Fedora

    1. Answers requests for resources, as normal

    2. Provides storage for objects supporting validation: policy, models, and constraints

    Information architect/developer

    1. Defines the model, and its constraints

    2. Expresses the constraints in a manner supported by the API Extensions Architecture

      1. by developing a custom plugin to perform validation

      2. by using a constraint language supported by the API Extensions Architecture

    3. Configures the API Extensions Architecture by defining a policy determining which model(s) are subject to validation, and when.

    4. Develops a repository client that may be responsible for periodically invoking validation on objects?
    1. While rdf:type is one way (and an intuitive one at that) to associate the objects being validated with a validation service, I think we are in agreement that we prefer to think of a more abstract policy concept; a policy is determining what objects are being validated by what service, and rdf:type is one possible implementation path.

       

      • Expresses the constraints in a manner supported by the API Extensions Architecture

       

        1. by developing a custom plugin to perform validation

        2. by using a constraint language supported by the API Extensions Architecture

       

      • Configures the API Extensions Architecture by defining a policy determining which model(s) are subject to validation, and when.

       

      Not to be pedantic, but aren't the tasks of "defining a policy determining which model(s) are subject to validation" and "using a constraints language"  under the purview of a validation extension itself, rather the API extension architecture?   Stated another way, the API extension architecture itself does not know anything about validation or content models or constraint languages, but it does know what an object is, what an HTTP request is, and how to route requests to services.  

      So if the underlying issue is figuring out a better method than an rdf:type marker to determine whether the API extension architecture binds a specific extension to a request on an object, we can try to do that.  To me, the intuitiveness, simplicity, and explicitness of an rdf:type marker makes it an attractive piece of data the API Extension architecture can use to bind a particular object to a particular service in response to an appropriate request.  I suppose I'm having a hard time understanding where this breaks down for the validation use case(s)?  

      1. Not to be pedantic, but aren't the tasks of "defining a policy determining which model(s) are subject to validation" and "using a constraints language"  under the purview of a validation extension itself, rather the API extension architecture?   Stated another way, the API extension architecture itself does not know anything about validation or content models or constraint languages, but it does know what an object is, what an HTTP request is, and how to route requests to services.  

        No, the pedantry is welcome.  I think I've had the role of the extension architecture and individual extensions themselves conflated in my head.  So what you say makes sense.

         I suppose I'm having a hard time understanding where this breaks down for the validation use case(s)?  

        What happens if you want to only validate myns:Image objects that are submitted to a particular collection?  If I understand what you're saying, the architecture routes the request to the validation extension, and it is the extension's responsibility to determine whether or not to perform the validation?

        1. What happens if you want to only validate myns:Image objects that are submitted to a particular collection?  If I understand what you're saying, the architecture routes the request to the validation extension, and it is the extension's responsibility to determine whether or not to perform the validation?

          If the extension architecture globally involves the validation extension always for every resource under every circumstance (as I believe Stefano is suggesting it should), then yes, it would be entirely the extension's responsibility for determining whether to validate.  I'm suggesting that (short of designing a validation extension as part of this work) it would also be reasonable to design a validation extension where this isn't the case.  If the API Extension architecture binds a request to a particular extension only where there is a marker present  (like rdf:type of cma:validatable), then the scenario you describe above can be handled by placing rdf:type of cma:validatable on the collection(s) you want to validate, and omitting it on collections you don't want validated at all.  

          Both validation extension scenarios may be reasonable.  If there is agreement that this is the case, then they both can be used to generate requirements for the API Extension Architecture (e.g..'it shall be possible to bind an extension so that it is always enabled globally' and 'it shall be possible to bind an extension based on the presence of a specific rdf:type marker', etc).  Then whatever approach an  actual validation takes is irrelevant, because we will expect it to work either way.

    2. some type of policy that determines when validation is invoked (on 'ingest', on 'copy').

      +1

      In the case of a UUID, for example, I want to allow the client to set it on resource creation, but after the resource is created, that property cannot be changed.

      I would be tempted to map different validation scenarios to HTTP methods, but that may not work with the example above because PUT can both create and update a resource. In that case I would need some extra logic (possibly provided by a plugin as you suggest).

  4. Possible reasons you may want to expose a validation service for ad hoc invocations is because:

    • objects in the repository might be modified by a client that is not knowledgeable of the content model (how this happens, I don't know)
    • validation on ingest was disabled for a period of time (say during a bulk ingest) and was later re-enabled, so you want to validate the objects that were ingested.
    • you have new constraints you want to place on objects already within the repository, so you want to validate the existing objects against the new constraints.
      1. Exposing a validation service to report on the validity of an object already in the repository

      If validation is extended to all PUT, POST, PATCH and DELETE operations, what is the reason for checking the validity of an existing resource when you are not modifying it?

      Also, the ad-hoc 'validation service' supports inherently asynchronous workflows.  In that case, a repository manager might not want synchronous validation on POST, PUT, PATCH, DELETE at all.  Imagine a scenario where contributors deposit initial/incomplete content that is subsequently updated/refined until it reaches a publication point.  In that scenario, it may be perfectly reasonable for the repository to at least persist and make accessible 'invalid' content so that it may be fixed at some future point.  

      1. Aaron, Elliot,

        I think we are talking about two different concepts.

        What I mean by "valid" resource is a resource that is eligible to be classified with a certain content model. A "work-in-progress" item may still have some hard constraints (e.g. a UID) which, if not satisfied, should prevent it to be stored in the repository as it is "bad data".

        That is why I think point 1 and 2 in Elliot's scenario should not happen in a "normal" scenario. IMO, if you are relying on a content model, you would expect all stored resources to follow it all the time. That is, validation may be asynchronous, but the resource should not be persisted until validation is passed.

        As for point #3, that is a very normal scenario.

        That said, I agree with the need for ad-hoc validation. In all cases, we need to be able to re-validate part or all the repository.

      2. Imagine a scenario where contributors deposit initial/incomplete content that is subsequently updated/refined until it reaches a publication point.  

        Your use case could maybe be resolved by using a "basic resource" content model to which a "publish-ready resource" content model is added at a later time: the former has constraints mandatory for that resource to even be ingested, the latter additional constraints to classify the resource as "completed". This implies the use of multiple content models on a resource, which may need some extra discussion.

        1. This implies the use of multiple content models on a resource, which may need some extra discussion

          I had assumed, perhaps mistakenly, that a resource could participate in multiple models at once, because nothing prevents a Fedora resource from having multiple rdf:types, and presumably any one of those types could be linked to a content model.

          1. I assumed that possibility too, but I wanted to make it clear whether this may be the actual case or not; if it were, there may be more hings to consider, such as how to deal with conflicting validation rules.