2014-08-20 - Fedora Technical WG Meeting

Created by Andrew Woods, last modified on Oct 13, 2014

Time/Place

Time: 12:00pm Eastern Daylight Time US (UTC-4)
Place: Google-hangout, https://plus.google.com/hangouts/_/event/c1glu6soq43r1rr6ou17qtobug8

Attendees

Andrew Woods
Esme Cowles
Chris Beer
~~Daniel Davis~~ *
~~Declan Fleming~~
A. Soroka
~~Benjamin Armintor~~
Zhiwu Xie
Neil Jefferies (having issues with may data connection - may not be in the call if I can't fix it!)
Yinlin Chen

Note-taker =
Previous note-taker = *

Agenda

Review of areas of assessment
Action Item: Enhance descriptions of different areas (particularly 6, 7, and 8)
Architecture walk-through
Notes as comments on the wiki page.
Review to-date performance testing summary
Assign owners to (some number of) areas of assessment
Thought exercise: "What would be the technical "risks" of releasing 4.0 Production *now*"?
1. Or another way, "Where do we want to put next sprint's dev energy"?

Discussion

Architecture walk-through
- Message-emitter should be added to F4 diagram
- Do we need two diagrams? - no
  - as implemented
  - aspirational
- It would be beneficial to define ci-tests
There was interest in testing how to extend the code
Next meeting: Wed meeting next week 8/27 at noon ET

Actions

Esme to investigate current ModeShape development roadmap and how it aligns with F4
- clustering, etc
~~Adam and Ben to assess REST-API (goal of versioning this API)~~
Dan to enhance descriptions of "Areas of Assessment" numbers 6, 7, and 8

~~Neil to define initial set of system CI tests~~

meeting-notes

2 Comments

Neil Jefferies

Some illustrative digital collection profiles for the Bod...

Department	Digital Collection	Size [TB]	Items	Ave Item Size (MB)	Comments
Bodleian Libraries	Digitized Treasures	55	1,800,000	30.556	5-year growth factor: 2
	Google Books	47	106,200,000	0.443
	Ephemera	0.4	10,083	39.671
	Music	20.48	14,000	1462.857	5-year growth factor: 5
	Text Corpora (TEI)	0.01	25,000	0.400
	Maps	300	1,000,000	300.000	Number estimated
	ORA	0.5	155,500	3.215
	WW1 Archive	1	14,909	67.074
	Archival Records	2.05	1,000,000	2.050	BEAM (number estimated)
	Main Bodleian Inventory	0.1	13,000,000	0.008	MARC catalogue
	Letters	3.07	80,000	38.375	EMLO, 5-year growth factor: 3
	Special Catalogues	0.1	62,000	1.613	TEI and MLGB3 based
Ashmolean	Images	10	400,000	25.000	Includes videos and audio
M. History Science	Records	1.5	40,000	37.500
Natural History M.	Records/Audio/Video	1.5	201,000	7.463
Pitt Rivers	Records/Images	32	1,000,000	32.000
Botanic Gardens	Records/Videos/Images	9.2	226,000	40.708	Includes Herberia and Bate

Does not include research data which has the potential to grow at approximately the same total volume as above per annum!

Permalink

Aug 20, 2014

Chris Beer
Write performance
Our most time sensitive collection (where ingest performance and throughput is important) is a feed of scanned books from an external vendor. With Fedora 3, we managed to pull material from the vendor at a rate of 300 books/hour. Each book was estimated at about 50 MB/book, and may easily contain several hundred pages images. The entire dataset is likely around 5 million books.
Most other collections have no ingest performance targets, other than "fast enough".

Read performance
Our repository currently averages 5 - 10 data change operations / minute, and regular bursts of changes. Indexing operations should be fast enough to keep up with these changes, and we should be able to scale the repository out to handle the read load.
Currently, we can index about 10 objects / second (including pulling all the object metadata from the repository from ~10 XML datastreams, and often a handful of other supporting objects (collections, policies, etc)). At that rate, we can reindex our entire repository in under a couple days. Fedora 4 should have comparable or better performance.
- Permalink
- Aug 20, 2014

All content on the LYRASIS Wiki is licensed under the CC BY (Attribution) license, unless otherwise noted.