2023-06-13 Technical WG Agenda and Notes

Date

13 Jun 2023

Attendees

Karen Hanson, Dave Vieglais, Greg Janee, Donny Winston, Tom Creighton, John Kunze

Goals

ARK spec and NAAN schema transition

Discussion items

Item	Who	Notes
Announcements
Any news items we should blog about? Any calls for papers, submission deadlines, upcoming meetings we should note? Please add to Calendar of events. please do quick review of community update draft		dw: i'll be at IDW (International Data Weeik) gj: https://datacurationnetwork.org/events/annual-meeting/ dw, kh: blog post looks fine
ARK spec transition plan Tom Creighton analysis of event date ordering and dependencies date of interest to NAAN group: when do we advise new NAAN holders to resolve both forms? do we point them to one of the newer specs?		We didn't get to this. Will set up a separate meeting.
Topic: saving periodic dumps of ARK metadata Any thoughts on this exchange around April 20 on the arks-forum? AK: "My question was primarily focused on the long-term sustainability of what you are naming secondary content, that is, metadata. It is promising for the future of the ARK system, that there could be enhancements in the latest N2T software that may extend its capabilities in a way that opens the system to more ARK organizations, perhaps enabling them to deposit metadata in external storage." DW: "One position on ensuring long-term metadata availability is the "Available data" bullet of <https://openscholarlyinfrastructure.org/#insurance>, i.e. "Underlying data should be made easily available via periodic data dumps." Crossref has adopted this position, and for its allocation of DOIs and stewardship of associated metadata, has so far provided three annual dumps available via torrent (last blog post, search for "Crossref" on academictorrents.com, landing page for their last (April 2022) dump). Their last dump, in April 2022, contained 134M records and is 160GB. DataCite has not yet provided a similarly clear dump of their DOI holdings, but someone has taken an interest in doing this for them, posting the dumps to archive.org, e.g. https://archive.org/details/datacite_dump_20221118 is the latest there. This, of course, is still fragmented for DOI holdings, i.e. one needs to gather such dumps from each DOI provider. This is perhaps a practically sustainable situation for the DOI system because the various providers are known and relatively (vs the ARK system) few in number. For the ARK community, I can see clear value in voluntary consolidation of e.g. CC0-licensable metadata across NAAs to a shared store on a periodic (e.g. quarterly or annual) basis. So yes, I also am interested in ongoing discussion on this topic. P.S. Another potential "leg" of redundancy is to use Amazon's current Open Data program (https://aws.amazon.com/opendata/) as e.g. the OpenAlex effort does for dumps (https://docs.openalex.org/download-all-data/download-to-your-machine). I stress "leg" here because by no means am I suggesting any singular dependence whatsoever on this large corporation's current offering of free hosting." NT: "It's been interesting following this discussion. I'm glad that Donny is pointing to POSI. Might a lightweight approach to storing additional copies of the n2t metadata are public github and gitlab repositories that could be updated monthly or on some sort of periodic basis through a simple git comitt and push? If additional preservation is desired, internet archive or OSF might be a good choice." JK: "I'll make sure to add it to the agenda in the ongoing discussions we are having in the ARK Alliance Advisory Group."		dw: I think it'd be useful to have a dump of the fact that, for example, ark:12090/ is intended to pass through to https://lib.cam.ac.uk/ark:$id . This is currently only known if https://n2t.net/ark:12090/ resolves. so at a minimum, a dump of https://n2t.net/e/pub/naan_registry.txt and each https://n2t.net/ark:<naan>/ response. one option is the internet archive, depending on our initial scale: https://archive.org/details/ia_biblio_metadata e.g. the ORCID public data file: https://archive.org/details/orcid-dump-2021 dv: ideally, naan registry will allow preferred forwarding form, then new orgs could do it gj, tc: dumps may not be as useful as oai-pmh tc: where do the dumps live, where do they come from gj: POSI say this data should be accessible, without saying how dw: yes, eg, orcid stores annually in IA, signaling at least intent tc: would be reasonable for FamilySearch to provide metadata on demain dv: would be huge to have list of ids minted dv: wonder if we could recommend that resolvers provide a list of ids dw: particularly for an opaque identifier, some descriptive information is helpful. for DOIs for publications, this metadata is typically citation metadata such as (author,title,journal,date,etc) so the interest would be in archiving the equivalent of (who,what,when,where) as per e.g. the ERC metadata kernel advocated in the ARK draft spec: https://www.ietf.org/archive/id/draft-kunze-ark-37.html#name-the-electronic-resource-cita dw: can be hard to put up an API compared to a file kh: challenging to get everyone to do the same thing tc: we have to deal with attacks dv: a list of ARKs would increase the attach surface dv: It’s more that advertising the list by passes one chunk of work done by scrapers to discover the entry points to the data. dw: attack surface suggests that some ARKs aren't meant to be open tc: we have obligations to get data out to partner sites
Dave Vieglais produce documentation and new NAAN schema for validation and form processing (Donny Winston can help)		dv: https://github.com/CDLUC3/naan_reg_public possibly part of new Maria's meeting on proposed naan schema changes

Action items

Page tree

2023-06-13 Technical WG Agenda and Notes

Date

Attendees

Goals

Discussion items

Action items