Title (Goal)Amherst - JSON-LD compaction service
Primary ActorDeveloper
ScopeComponent
Level 
Author Unknown User (acoburn)
Story (A paragraph or two describing what happens)In order to improve front-end (read) performance, it would be useful to store fedora resources as JSON in a key-value store (riak, mongodb, couchdb, etc, etc). That way, the objects can be more efficiently delivered to a web-based framework without needing to access fedora at all. Fedora will already generate JSON-LD in expanded form, but for application-specific use (applications that don't necessarily understand RDF), a compact form would be preferred. This would simply involve applying a context file to the expanded JSON-LD form.

A sample implementation is available here: https://github.com/acoburn/repository-extension-services/, in particular the acrepo-jsonld-service and acrepo-jsonld-cache modules.

Web Resource interaction

This service would expose an HTTP endpoint to convert fedora resources into a compact JSON-LD representation.

Deployment or Implementation notes

This service would be deployed separately from fedora, possibly on a separate machine. I envision that this would be implemented as a combination of OSGi services and camel routes that can be deployed in any OSGi container, written in Java and Blueprint XML. The implementation would require access to Fedora's HTTP API.

API-X Value Proposition

The primary use of this service in the context of API-X would be to allow for service discovery.

5 Comments

  1. Ah, so does this encompass two potential use cases?:  

    1. Providing a means to expose a representation of Fedora resources as compacted JSON 
      • Maybe filtering responses and translating them on the fly, perhaps in response to a Prefer header, or some other indicator that compact form is desired
      • Maybe exposing a URI to a compacted representation
    2. Directing requests to the cache where appropriate
      1. Filter incoming requests for simple GETs.  If deemed satisfiable by cache lookup, then poll the cache for the object and return that.

    If (2), then would this be in addition to (and behind) a caching proxy such as squid?

  2. Unknown User (acoburn)

    Yes, this does encompass both cases. See implementations here: https://gitlab.amherst.edu/acdc/repository-extension-services/tree/master/acrepo-jsonld-cache and https://gitlab.amherst.edu/acdc/repository-extension-services/tree/master/acrepo-jsonld-service

    Related to your question about (2), one could use a caching proxy as part of this, but I see this as unnecessary. In our current (fedora3) repository, we use Riak as a type of cache. Riak (like other such systems) have several advantages over a simple caching proxy (squid, varnish, etc), including: being able to shard and replicate the data over an arbitrary number of back ends (providing both higher throughput and better fault tolerance), and also allowing for Map-Reduce operations over arbitrary sets of data in the cluster, which a simple proxy cache cannot do.

    In my experience, the read performance of riak is so good that an additional proxy is really unnecessary.

  3. Fascinating that performance is so good!    Your implementation (from what I understand, just quickly looking through the code) could be deployed on an arbitrary karaf instance in someone's backend infrastructure, with the caching service available via requests to http://${some.host}:${some.port}/jsonld. Maybe you have several of these services running on different hosts.  

    How would you envision API-X making cached representations of objects available to the public?  Would it be through filtering incoming GET requests to the repository and polling the caching service (as speculated above in my initial comment) so that it happens transparently?  Through providing additional representations of the object, at their own URIs, backed by the cache?  Both?

  4. Unknown User (acoburn)

    Performance is excellent, and if you need it to handle higher throughput, you just add more backend nodes. Typically, with Riak, you have an arbitrary number of nodes (it's masterless and can scale up or down easily), and you set up one or more reverse proxies (e.g. haproxy) pointing at that cluster, so your service points to that single location (I've never needed more than one instance of haproxy running). So yes, you can have one or more instances of karaf running, each pointing to its own local instance of haproxy (which points to the riak cluster). To start, I don't imagine needing more than a single instance of karaf for this, but this architecture is embarrassingly easy to scale, even with a single instance of fedora.

    For API-X, I'd have incoming requests pull the data directly from riak. If that fails (404 or otherwise), the request will attempt to extract the resource directly from fedora. But yes, I believe your earlier speculation about how it works is correct. (I also store thumbnails and other small binary objects there, too, since throughput performance is so much better than fedora3 – that does change with fedora4, but I will probably still cache small binaries like this).

  5. This seems like a special case of a more general idea: "Use sophisticated caching (equipped with minimizing abilities) in front of Fedora." I'm not sure in what way it "extends the Fedora API"? There are no new functions here...