Time: 9:00am PDT / Noon EDT
Call-In Info: 1-641-715-3660, access code 651025
Moderator: Justin Coyne
Notetaker: Jennifer Lindner
- LaRita Robinson (Notre Dame)
- Lynette Rayle (Cornell)
- Adam Wead, Carolyn Cole, Mike Tribone, Dan Coughlin (Penn State)
- Collin Brittle (Emory)
- Michael J. Giarlo (Stanford)
- Justin Coyne (Stanford)
- Trey Pendragon (Princeton)
- Jennifer Lindner (DCE)
- Peter Binkley (Alberta)
- Chris Colvard (Indiana University)
- Steven Ng (Temple)
Roll call by timezone per following order - ensure notetaker is present (moderator)
folks outside North and South America
folks who were missed or who dialed in during roll call
- Welcome all newcomers!
- Agenda (moderator)
- Moderator/notetaker for next time (moderator)
- After call, this week's notetaker should create the agenda for the next call.
B. Proposed changes to how Hyrax handles uploaded files (Joe Atzberger)
Currently, the code we have was expected to be run on same box by workers, okay for some things but not all, not for remote files. In order to support that files are not on same box we need an abstraction layer - we propose carrierwave model, where it isn't currently used. It used in most of the code, just not all the way throughout into things like update jobs. There are tradeoffs - at scripting level it's easier to think about local files, and if you have remote storage you have to copy files into Carrierwave cache, more IO. In some cases not a big deal, but can slow down a job.
Trey - We ran into this with 4 workers used NFS stashed files -- want to make sure we don't have significant slowdown.
Is there a way to have abstract layer support local/shared filesystem as well? Joe - Carrierwave does this, we'd need to implement it.
It was asked if Adam or others who use shared filesystem could document how they use it in file ingest? In general it might be nice to have a wiki page about this.
Anna - advocate for considering cost of additional abstraction, so to oppose the option of supporting local filesystem.
Joe - the use case for this new work in Hyrax is to cater to high-end systems, high volume jobs, so I advocate for this one, and get better return for more common use/prioritized use case.
There's a PR -- think it's 1206 -- one benefit is that really simplifies the interface. Using Carrierwave is much better than what we have which is pass arguments to each job that needs them.
Another important drawback with current code - we cast any files we ingest to a string and try to hold it all in memory right now. Obviously that's a problem with a 50GB file, anything would be an improvement.
Carrierwave has two directories, for store and cache, so should work for local filesystem use case.
Action items -- call for review of the current PR, Joe's code -- right now it prevents failure if file isn't local, but runs jobs inline. After this gets merged, we'll need new code with abstraction layer for remote files and that would require testing.
Merging PR 1206 means we will no longer be able to run async jobs. A local or shared filesystem is not a requirement of Hyrax. It was not intentional, not explicit, but there's a dependency on shared filesystems. We changed code so that we could use S3, but missed use case of updating a file, and now we have the bug that Joe wants to fix.
So we could say that the PR is blocked until the next PR is ready, put it in a branch, and not merge until next release. But it does solve a real bug, just with a cost of performance. Let's put these concerns in the PR and discuss there.
Update on the Collections enhancement working group -- we're in our 2nd phase, mockups for UI. The third phase is development sprints, and we have timelines in place. In early July organizational stuff starts, getting issues written up, then two two-weeks sprints, for a total of six weeks. There's a place to sign up, entering your name and institution. Michigan is hosting August 21st. People can sign up saying it's of interest but that doesn't commit anyone. There are a number of documents and requirements that came out of the working group, one of which is collection type. So, you could create a user collection, with configuration values, discoverable, for instance. Applications could create an exhibit type, or a user type.