Blog from September, 2008

Week of 2008-09-22

For Akubra, I started this week by scouring the web for existing blob storage APIs.  I've been keeping disorganized notes here and there, and decided it was past time to start a wiki page on the topic.  As the links started piling up, I realized it would be really useful to have a "capability matrix" for the APIs as well as existing implementations.  It's not finished yet, but here's where I'm keeping it: Analysis of Existing Approaches. I hope to have the matrix finished by Monday night.

This week, I also:

  • Reviewed Eddie's Mulgara/SPARQL update for 2.2.4.
  • Updated the subversion config in src/build to use correct line endings for .sh/.bat.  Bill updated the svn properties on the existing files.
  • Got an open source/nonprofit license for the Gliffy Confluence Plugin.  Dan installed it.  I really like the way these plugins can be installed and enabled straight through the Confluence admin interface, while it's running.
  • Merged in FCREPO-253 after Bills review
  • Started reviewing Eddie's branch, FCREPO-254
  • Did some top-secret firewall configuration stuff.  I suppose it's not good for me to go into details publicly.
Week of 2008-09-15

This week, I:

  • Concentrated almost exclusively on Akubra (see below)
  • Did some tweaks on our custom "Reviewer" workflow in our Jira installation
  • Set up nightly postgres snapshots on the production box since we're now in production with Confluence and Jira
  • Copied Eddie's test "Developer's Blog" wiki page and added an Atom feed.  This now lives in, and is linked from the Developer's Forum space in Jira.  We're now each doing weekly updates as News items in our personal space, which then get aggregated to this page.

Akubra progress

  • Got initial File System (org.fedoracommons.akubra.fs) implementation done
  • Gave an update on Akubra at the weekly architecture meeting on Tuesday
  • Had some good discussions on the Akubra Developer's Mailing List.
  • Learned about the API behind the Storage Resource Manager (SRM), which is used by the Large Hadron
    Collider Computing Grid (LCG).

I also had an initial exchange with Richard Rodgers regarding DSpace-Fedora Commons collaboration on low-level storage. Richard has been working on Pluggable Stores for DSpace 2.0, and from the look of it, we have a lot of the same sensibilities about this layer being thin and free of higher-level OAIS semantics. I'm looking forward to more discussions to come.

Akubra Intro and Update

Below are the notes I used for the Akubra update in today's architecture meeting.

What's Akubra About?

  • A standard Java interface for reading/writing files, but at a different level of abstraction than a filesystem
  • Transactional by design (but implementations may ignore transaction semantics)
  • Exploring web-based exposure
  • From the Akubra wiki: Requirements and Goals

Note: The Akubra wiki is hosted at http://topazproject.org/akubra

Anyone is welcome to sign up to the dev and general mailing lists.

Filesystem vs. BlobStore

Common filesystems:

  • Have directories
  • Can provide system metadata about files (e.g. size, modified date)
  • Allow partial reads and writes of files

An Akubra BlobStore:

  • Has a collection of URI-addressable bitstreams (no "directories")
  • Only provides the size of each file -- is not concerned with other system metadata (yet?).
  • May allow partial reads (InputStreams can skip()...)
  • Does not allow partial writes

Java API

This is in flux.  We are currently testing the design with a simple filesystem implementation.

Blob (A finite, readable bitstream)
BlobStore (For getting connections)
BlobStoreConnection (For CRUD operations) 

 Transactions

  Level of support varies per-implementation (some can "fake it")
  Why: To execute a mixed set of CRUD operations of several files as one atomic unit of work.
  Observation: We can build a transactional blob store on top of a non-transactional one...with the help of a DB.

Example non-transactional BlobStore: FSBlobStore (see FSBlobStoreConnection)

Higher-level BlobStore TBD:

  • Uses FSBlobStore to persist data
  • Uses database to support transactions (via id mapping)

Other possible storage Plug-Ins:

  • S3 (anything based on current LLStore should be easy to port over)
  • ZFS (already transactional, does not need layering)
  • Centera (content-addressible...ids not available till content is written)
  • Sam/QFS (hierarchical storage implies graceful handling of delays...not a use case we've factored in yet)

Web-based exposure?

  • Opens up use of akubra-java impls to other (remotely-running) programs
  • Allows an akubra impl that's a client to remote akubra instance
  • Lots of interesting possibilities!
Week of 2008-09-08

This week I completed the initial migration from the Sourceforge Trackers to our Jira installation.  They're now under the Jira Project, FCREPO (view all issues).  This process gave me a chance to learn quite a bit about Jira and how to customize things. 

Bill and I talked late in the week about how to use Jira to replace our current branch log (a history of who worked on what branches, when).  What we finally came up with was a new issue type, "Code Task", whose id will serve as the id for the branch (e.g. fedora/branches/fcrepo-123).  We also added a "review" step to the workflow for these types of issues.  Besides being easier than hand-updating the old branch-log.txt file, we now have an easy way to list the outstanding branches (view outstanding branches).

I also started reviewing many of the long-outstanding feature requests and found a lot that we need to revisit and either specify better or decide if they're still relevant.  To logically separate these from those features we know are on the immediate horizon, I've given them a "Pending Feature Request" issue type.

On Thursday, I applied a patch submitted by Atos, which fixes a threadsafety issue with file uploads and improves performance when many uploads are happening at once.  This has gone into my bugfix branch for 3.1, which I'll need to get reviewed soon.

On Friday, Aaron and I talked with members of the Digital Library Infrastructure team at Stanford.  They're getting up and running with Fedora and are in the process of deciding how to model their objects.  They have quite a large collection of digitized books, image collections, special collections, and archives, and will be using Fedora as an object registry in their architecture.  One thing that came up in the call was that the out-of-box modeling language in Fedora 3.0 doesn't have ordinality/cardinality constraints.  I don't know how the ds-composite-model schema will evolve in future releases, so I recommended for the time being that they use a separate datastream in the CModel object to hold such constraints.

This morning I've been looking into doing a scripted migration our Sourceforge trackers to Jira.  I wasn't able to find a working migration utility that does this, but it looks like it won't be too bad.  Sourceforge allows the export of all project data (including trackers) to a big XML file, and Jira has several import options, the most palatable being the Jelly scripting facility.  Right now it's a matter of doing the right mapping.