Time/Place

Attendees

  • Andrew Woods
  • Esme Cowles
  • Chris Beer (star)
  • Daniel Davis *
  • Declan Fleming
  • A. Soroka
  • Benjamin Armintor
  • Zhiwu Xie
  • Neil Jefferies (having issues with may data connection - may not be in the call if I can't fix it!)
  • Yinlin Chen

Note-taker = (star)
Previous note-taker = *

Agenda

  1. Review of areas of assessment
    Action Item: Enhance descriptions of different areas (particularly 6, 7, and 8)
  2. Architecture walk-through
    Notes as comments on the wiki page.
  3. Review to-date performance testing summary
  4. Assign owners to (some number of) areas of assessment

  5. Thought exercise: "What would be the technical "risks" of releasing 4.0 Production *now*"?
    1. Or another way, "Where do we want to put next sprint's dev energy"?

Discussion

  • Architecture walk-through
    • Message-emitter should be added to F4 diagram
    • Do we need two diagrams? - no
      • as implemented
      • aspirational
    • It would be beneficial to define ci-tests
  • There was interest in testing how to extend the code
  • Next meeting: Wed meeting next week 8/27 at noon ET

Actions

  • (tick) Esme to investigate current ModeShape development roadmap and how it aligns with F4
    • clustering, etc
  • Adam and Ben to assess REST-API (goal of versioning this API)
  • (tick) Dan to enhance descriptions of "Areas of Assessment" numbers 6, 7, and 8

  • Neil to define initial set of system CI tests

 

2 Comments

  1. Some illustrative digital collection profiles for the Bod...

    DepartmentDigital CollectionSize [TB]Items Ave Item Size (MB)Comments
    Bodleian LibrariesDigitized Treasures551,800,00030.5565-year growth factor: 2 
     Google Books47106,200,0000.443 
     Ephemera0.410,08339.671 
     Music20.4814,0001462.8575-year growth factor: 5
     Text Corpora (TEI)0.0125,0000.400 
     Maps3001,000,000300.000Number estimated
     ORA0.5155,5003.215 
     WW1 Archive114,90967.074 
     Archival Records2.051,000,0002.050BEAM (number estimated)
     Main Bodleian Inventory0.113,000,0000.008MARC catalogue
     Letters3.0780,00038.375EMLO, 5-year growth factor: 3
     Special Catalogues0.162,0001.613TEI and MLGB3 based
    AshmoleanImages10400,00025.000Includes videos and audio
    M. History ScienceRecords1.540,00037.500 
    Natural History M.Records/Audio/Video1.5201,0007.463 
    Pitt RiversRecords/Images321,000,00032.000 
    Botanic GardensRecords/Videos/Images9.2226,00040.708Includes Herberia and Bate

    Does not include research data which has the potential to grow at approximately the same total volume as above per annum!

  2. Write performance

    Our most time sensitive collection (where ingest performance and throughput is important) is a feed of scanned books from an external vendor. With Fedora 3, we managed to pull material from the vendor at a rate of 300 books/hour. Each book was estimated at about 50 MB/book, and may easily contain several hundred pages images. The entire dataset is likely around 5 million books. 

    Most other collections have no ingest performance targets, other than "fast enough". 


    Read performance

    Our repository currently averages 5 - 10 data change operations / minute, and regular bursts of changes. Indexing operations should be fast enough to keep up with these changes, and we should be able to scale the repository out to handle the read load.

    Currently, we can index about 10 objects / second (including pulling all the object metadata from the repository from ~10 XML datastreams, and often a handful of other supporting objects (collections, policies, etc)).  At that rate, we can reindex our entire repository in under a couple days. Fedora 4 should have comparable or better performance.