Time: 1:00 PM EST / 10:00 AM PST
Call-In Info: 712-775-7035 (Access Code: 960009)
Moderator: Eben English (Boston Public Library)
Notetaker: Cliff Wulfman (Etherpad link: https://etherpad.wikimedia.org/p/Samvera_Newspapers_Interest_Group_Call__2018-01-04)
- Nick Homenda (Indiana)
- Gordon Leacock (Michigan)
- Brian McBride (Utah)
- Clifford Wulfman (Princeton)
- Sean Upton (Utah)
- Eben English (BPL)
- Ingest Scenarios: https://docs.google.com/document/d/10KnzsHubEeRRVH1K8CoCVzq5295ONzq94VHDUMzAskg/edit?usp=sharing
Working on documentation related to ingest use cases (PDF, NDNP data, etc.)
What is required for each?
What files already exist?
What derivatives need to be created?
- How should full text be indexed?
How should batch ingest UX be structured?
- Content Examples: https://drive.google.com/drive/folders/0BwKKtxaBVqjEbE5zMFdWUEU4WGM?usp=sharing
- PDF issues
- Still need:
- Design Overview docs
- NewspaperWorks (admin gem): https://docs.google.com/document/d/1X6OLz9OfoyMyUBsCuLUMROe9EBqF6upDqPhPKFQxLAY/edit?usp=sharing
- NewspaperViews (display gem): https://docs.google.com/document/d/1LorDyCVB9UW6exfA1y5fG2bQb6ASXLHa03nNgQYrtZA/edit?usp=sharing
- Intel sharing from other groups/members
- Next meeting: Thursday February 1, 1 PM EST
Grant updates: kick-off has happened and project is under way. Starting with the models, as laid out in the PCDM docs, as Ruby code, and the relations among objects. Getting things wired up is the first order of business. Next up will be create and edit forms for Newspapers at titles, and then proceed down the chain to file attachments. Not too much code to share yet, but GitHub repos have been set up and will be sharable shortly.
PCDM Profile: There have been updates since last meeting (see links above). Most significant is adding object for newspaper-article images (images that represent just an article). These are PCDM filesets that are children of newspaper article objects. Discussion of whether to order articles in an issue (use case: presenting a TOC or presenting user with ordered list of articles on a page). Added proxy for article and issue to support that ordering.
For Newspaper container objects (microfilm reels, bound volumes), which sometimes have non-newspaper content (reference targets, e.g., or title cards), we've added a fileset that is child of newspaper container object that an act as a parent for all those things. Ordering does not seem important for these.
For the newspaper-issue object, you might have a PDF or full text of that object, so there's a fileset for those, too.
Question: where would the OCR for a particular article or a particular page go?
Answer: page is a PCDM fileset that contains any combo of files you might have. That's at the page level; for the article level – you might have OCR for article without an image. Might be able to store it in coordinated format, or something other than plain text, for OCR. I.e., wouldn't this be another file in the fileset (an ALTO file, an hOCR file, etc.) Plain text makes indexing or reindexing much easier, though.
Maybe page isn't the level for articles: maybe change Newspaper article image to Newspaper Article Fileset?
Redundancy of the diagram: it's evolved over time and gotten a bit cluttered... Eben will come up with a simplified version.
(See doc above). Developed as part of grant application to flesh out some of the use cases for batch ingest, etc. What would actually have to happen in the application to fulfill these scenarios? What is an administrator's user experience when performing an ingest? This was half-brainstorming, half getting into the nitty-gritty of what batch ingest of particular formats actually means.
(Eben reviews the document; comments welcome)
Google-drive folders have been set up (see link) so people can provide samples of real-world products from vendors. Could really use some Contentdm or Olive, or Veridian Export. There are different flavors of METS/ALTO, too. Please feel free to add!
Design Overview document
Encouraged to review this Agile-like doc, with a rough roadmap.
(none at this time)