Page tree
Skip to end of metadata
Go to start of metadata

What are ARKs?

ARKs (Archival Resource Keys) are high-functioning identifiers that lead you to things and to descriptions of those things. For example, this ARK,

     https://n2t.net/ark:/67531/metadc107835/

gets you to a dissertation, and adding a '?' on the end of the ARK should get you to its description:

     https://n2t.net/ark:/67531/metadc107835/?

What's an identifier?

On the internet, an identifier is a URL, or part of a URL. For example, this basic ARK identifier,

                            ark:/12148/btv1b8449691v/f29

appears inside two different URLs (Uniform Resource Locators, also known as web links or web addresses):

     https://gallica.bnf.fr/ark:/12148/btv1b8449691v/f29

            https://n2t.net/ark:/12148/btv1b8449691v/f29

ARKs are especially good at being persistent identifiers.

What's a persistent identifier?

The average lifetime of a URL was once said to be 44 days. At the end of its life, a URL link breaks, meaning it gives you the dreaded "404 Not Found" error that most of us have seen. Irritating as that may be, it's politically awkward when looking for publicly funded research, and it's a cultural disaster for libraries, archives, museums, and other memory organizations.

persistent identifier (sometimes abbreviated PID) is a link that in principle keeps working far into the future, even as things move between websites. Normally when things move, everyone who ever recorded the old links would need to be told what the new links are, which is next to impossible. That's where identifier resolvers come in.

What's a resolver?

resolver is a website that specializes in forwarding incoming identifiers (the ones originally advertised to users) to whichever websites are currently best able to deal with them. Overall, forwarding is called resolution; one step in a resolution process is called redirection

For a resolver to work, its hostname must be carefully chosen so it won't ever need to be changed. Memory organizations, some of them centuries old, tend to have hostnames well-suited to be resolvers. Some well-known, younger resolvers are n2t.net (the ARK resolver), identifiers.org, doi.org, handle.net, and purl.org.

What are ARKs used for?

For anything and everything. Uses of ARKs include

  • digital content, such as genealogical records (FamilySearch)
  • publisher content (Portico)
  • digitized manuscripts (Gallica)
  • texts (Internet Archive)
  • museum holdings (Smithsonian)
  • vocabulary terms (yamz.net, perio.do)
  • historical figures (snaccooperative.org)
  • datasets, journals, living beings, and more.

Why would I use ARKs compared to, for example, DOIs?

  • To keep costs down.
  • To work with exactly the metadata I want.
  • To be able to create identifiers without metadata.
  • To create an identifier as soon as I create the first draft of my data.
  • To keep that identifier private while the data and metadata evolve, and decide (maybe years) later, to publish or discard it.
  • To retain that identifier upon publication, perhaps then assigning an additional identifier, such as a DOI.
  • To integrate with the Data Citation Index ℠ and ORCID.org researcher profiles.
  • To link identifiers to different kinds of nuanced persistence commitments.
  • To be able to add queries (eg, ?lang=en) when resolving my identifiers.
  • To use open infrastructure consistent with my organization's values.
  • To create one identifier that enables millions (suffix passthrough).

What does ARK have in common with DOI, Handle, PURL, and URN?

These are the major persistent identifier types (or schemes). They have all been around since 2001 and they have much in common, starting with structure.

 https://n2t.net/ark:/99999/12345

   https://doi.org/10.99999/12345

https://handle.net/10.99999/12345

     https://purl.org/99999/12345

https://<various>/urn:99999:12345

As seen in these examples, they all have three parts:

  1. the protocol (https://) plus a hostname,
  2. just for ARK and URN, there's also a label ("ark:" or "urn:"),
  3. the name assigning authority (99999, 10.99999, or purl.org/99999), which is the organization that created a particular identifier,
  4. and finally, the name, or local identifier, that it assigned (12345). 

And they all have little effect on persistence.

Wait, are you saying ARK, DOI, Handle, PURL, and URN are useless?

No, that's too strong a statement. But let's keep these identifier schemes (types) in perspective.

  • They all fail to stop the major causes of broken links: loss of funding, natural disaster, war, deliberate removal, human error, and provider neglect.
  • They all require you, the end provider, to update forwarding tables as URLs change.
  • They all identify content that is subject to change or removal on future visits.
  • They all have identifiers that break regularly and in large numbers – many thousands and more.
  • They all give access to almost any kind of thing, whether digital, physical, abstract, person, group, etc.
  • A non-trivial fraction of each scheme's identifiers did, and will, fail permanently, requiring forwarding to "tombstone" pages.
  • They all rely on ordinary redirection built in to web servers since 1994 and provided for free by hundreds of URL shortening services.

Given how little the schemes do for you, when choosing one you'll likely want to consider factors such as cost, risk, and openness.

How do ARKs differ from identifiers like DOIs, Handles, PURLs, and URNs?

The short answer is that ARKs are the only mainstream, non-siloed, non-paywalled identifiers that you can register to use in about 48 hours. DOIs, Handles, and PURLs require resolution and other services to come from their respective centralized systems (silos). 

That's not to say that persistence is free. Making any identifier persistent burdens you, the provider, with the costs of content management, hosting, monitoring, and forwarding. You can do those things yourself or with help from a vendor. But with ARKs, just as with URLs, you will not be charged separately for your identifiers and you will not be locked in to a special-purpose resolution silo that also locks out other identifiers.

ARKs are unusual in being decentralized. While one can get resolution services from a global ARK resolver called n2t.net, over 90% of the ARKs in the world are published without reference to it. More than 500 registered organizations across the world have created an estimated 3.2 billion ARKs, and, as with URLs, no one has ever paid to create them. Of course maintaining them isn't free. It is never without cost to keep content access persistent in the long term, regardless of identifier type.

How else do the identifier types differ?

Here are some more differences between DOIs, Handles, PURLs, and URNs.

  • All things eventually pass, including hostnames and the web itself and the "https://" protocol. When that first part of the identifier ceases to have meaning, only ARKs and URNs will include the label (eg, "ark:") indicating the type of identifier that remains.
  • For DOIs, Handles, and PURLs, you are required to use their respective resolvers. ARKs and URNs, permit you to use your own resolver.
  • To create DOIs and Handles, you are required to pay a membership fee and, for DOIs, per-DOI charges. There are no fees for ARKs, PURLs, and URNs.
  • To create Handles, you are required to install and maintain a local Handle server, which gives you another system to monitor, patch, and troubleshoot.
  • Although you can use a local or vendor resolver for your ARKs and URNs, ARKs can be resolved via the global n2t.net resolver.
  • The envisioned URN resolution infrastructure was never built, so URNs are currently resolved as URLs, and there is no designated global URN-as-URL resolver. In order to register to create URNs, you must apply for a URN namespace.
  • Unlike DOIs and Handles, ARKs don't have metadata requirements. ARKs that haven't been released into the world are easy to delete.

For the purpose of supporting early object development, some distinguishing features of ARKs are that they can be deleted, they can exist with no metadata, and they can exist with any metadata you care to store.

Why are ARKs that haven't been "released into the world" easy to delete?

If no one knows about an identifier but you, there's no harm in deleting or withdrawing it. Stepping back, an identifier is actually an assertion that a given string of characters is associated with specific thing. The fewer people you tell, the easier it is to scrap that assertion. If you create a URL and share it only with your closest colleagues on an internal network, that is much easier to withdraw than if the URL appeared for a month on a public website where it was harvested by internet search engines. In contrast, it is hard to delete DOIs and Handles; once registered and made resolvable, they are essentially released to the world.

ARKs behave like URLs in this respect. Providers are free to create and share ARKs narrowly, in which case they're easy to delete. Even if shared more broadly, ARKs can come with persistence statements that tell you how much or how little commitment is made to them. ARKs were designed to articulate a variety of persistence statements. Persistent identifiers of other types exhibit a similar variety of commitment "flavors".

Finally, people make mistakes. ARKs, DOIs, Handles, PURLs, and URNs get broadcast in error and sometimes need to be withdrawn. When that happens, provider best practice is make a withdrawn identifier resolve to a page that explains and perhaps apologizes for the inconvenience. Despite what you may have heard, persistent identifiers are not guaranteed.

If ARKs don't require it, why would I bother to create metadata?

There are several key benefits to having metadata, which is strongly recommended for all ARKs that are mature (no longer under development) and released into the world. The standard way to retrieve the metadata is to resolve the ARK with a '?' or '??' appended to it.

First, no matter what the ARK redirects to, whether landing page or PDF file, metadata gives the user further information about the object, such as a description, other versions, etc. In contrast, so as make sure metadata is available, DOIs are required to redirect to landing pages (consistent with common publisher practice) and prohibited from linking directly to object content.

Second, it shores up the persistence of the binding between the ARK string and the identified object. The primary assertion of the binding is the experience of resolving the identifier to the object. A secondary, confirming assertion of the binding is the experience of resolving to its metadata. Any discrepancy between the the metadata and the object helps to detect changes in that binding.

Third, providing access to metadata demonstrates basic provider commitment. This adds credibility to the seriousness of its intentions since not every object provider can return object metadata. Finally, adding basic metadata, especially for objects that don't have textual representations, makes your objects more findable. 

What metadata is recommended for ARKs? xxx

ARKs were always meant to cover an enormous range of object types, xxx Dublin Core Kernel metadata.

What is a NAAN and how do I change it? xxx

NAAN stands for Name Assigning Authority Number. It is a 5-digit number that identifies an individual or organization that can assign its own ARKs.

Obtaining a NAAN is the only condition before you can start assigning ARKs. NAANs are freely granted by any individual or organization that request a NAAN by filling this form: https://n2t.net/e/naan_request.

NAANs ensure that ARKs don't collide at the global level, that is, that two separate organizations don't assign identical identifiers to two different things. In other words, NAANs are the cornerstone of global ARK unicity.

All NAANs are documented in a public NAAN registry.

What is meant by ARKs supporting early object development?

We need identifiers long before we know exactly what they refer to, or even if they refer to anything useful. An identifier that requires mature metadata cannot be used during early object development since little is known about the object. So object creators almost always initially assign identifiers that have no metadata requirements, such as URLs or ARKs.

If you start with an ARK, you benefit from being able to keep the original identifier through to public release as the metadata matures. Many objects go through intensive development and revision phases, sometimes lasting years, during which they are too immature to meet most metadata requirements. Nonetheless every object needs some sort of identifier from conception to maturity, where maturity could look like public release and further enhancement, or abandonment. It is easy to abandon ARKs that have not been released into the world.

Like the object itself, metadata elements need a flexible place to grow and mature over time:

  • starting in the planning phase, when it just needs an identifier,
  • at the moment of birth, when its first digital representation needs a redirection target URL,
  • after the first analysis, when its significance and a tentative title emerges,
  • when creating dozens of discipline-specific metadata elements that violate most metadata standards except your own,
  • during post-processing by a colleague whose name will be added as a creator,
  • when early feedback based on the tweeted identifier turns up a key insight and a new contributor,
  • and so forth, through public release, correction, revision, enhancement, etc.

Can an object have both an ARK and a DOI?

Yes. As mentioned above regarding early object development, if you start with an ARK, you benefit from being able to keep that original identifier through to public release as the metadata matures. For some of the objects you save you may also want a DOI.

Assuming you wish to maintain both the ARK and DOI identifiers (to avoid breaking links that your collaborators had stored and bookmarked), an easy way forward is to set up the DOI to redirect to the ARK. In this way you can ensure correct resolution of both identifiers while only having to maintain the ARK.

If ARKs can be deleted, how can they be trusted?

Only one resolver, n2t.net, supports all of these features, and it does so for any identifier stored with appropriate metadata. Contrary to popular belief, identifiers don't do anything – it's their resolvers that do or don't support these features. For example, suffix passthrough is a feature supported by n2t.net (and purl.org has something similar called "partial redirect"), but not by doi.org or handle.net.

By metadata flexibility is meant the ability to store any metadata you want, including repeated elements, such as multiple authors and forwarding URLs, or no metadata at all. N2T has full metadata flexibility, while Crossref and DataCite have specific requirements (eg, the DataCite schema) to create their DOIs.

What is an ARK "inflection" and how does it differ from "content negotiation"?

An inflection is a change to the ending of a word to express a shift in meaning, and it permits us to define a word such as "go" without also defining "goes" and "going". For an ARK that gets to an object, simply adding a '?' to the end (an example of an ARK inflection) permits us to request metadata without having to define a separate identifier for the object's metadata. This technique is simple enough to be used by humans using a web browser. The N2T resolver supports both inflections and content negotiation.

Content negotiation is a software technique for requesting alternate formats of an object, such as the PDF or RTF form of an HTML file. Although not designed for it, content negotiation has been twisted in certain contexts to request metadata under the kludgy assumption that formats often used to hold metadata are in fact metadata. Unlike inflections, "content negotiation for metadata" doesn't work at all for objects that have legitimate representations in those formats (the list of which is growing and known only by private agreement), nor can it be used directly by humans.

Although inflections are commonly associated with ARKs, they are not "owned" by ARKs. In fact, contrary to popular belief, identifiers don't do anything – it's their resolvers that do or don't support such features. So for example, inflections and suffix passthrough are supported by n2t.net for all identifier types, but not by doi.org or handle.net for any identifier types.

When should I use ARKs compared to DOIs, Handles, PURLs, or URNs?

There are no simple answers. Identifiers (not things, but their names) are tricky to talk about, so if you hear simple answers elsewhere, beware of common fallacies.

Nothing inherent in ARKs, DOIs, Handles, PURLs, or URNs makes them more or less fit for any particular field, domain, or sector. With an identifier resolver and administrative management service, they all provide the core service of resolution (and so do properly managed URLs). 

Generalizations about identifier types sometimes apply when resolution and management for that type is locked into one particular vendor or provider. For example, many PURL and Handle features and restrictions are well-defined by their respective administration silos. DOIs, which are built on top of Handles, have the same resolution features and restrictions as Handles, but metadata practices are diverse and evolving across registration agencies. 

The concrete differences that we experience, such as metadata, landing pages, and tool integration (eg, publishing tools), are not properties of identifier schemes per se, but properties of resolution, management, and citation services that various providers extend to or withhold from different identifier types. Those services are shaped in turn by communities of practice and by markets. Basic services are founded on a reliable database storing each identifier along with metadata elements (creator, title, date, redirection URL, etc) that describe the identified object. Extra services include link checking, duplicate detection, report generation, and searching.

What are usage trends for ARKs, DOIs, Handles, PURLs, and URNs?

As of 2019, purely on an incomplete and anecdotal level, here are a few trends that have been observed.

  • ARKs have seen broad adoption in cultural memory institutions – museums, archives, and libraries. There is strong adoption in France and francophone regions.
  • DOIs until recently have mostly been known as reliable identifiers for scientific and scholarly literature, when in fact this applies to a subset of DOIs assigned via Crossref. What it means to be a DOI is becoming harder to pin down because DOIs are being assigned to datasets, data management plans, field stations, etc. via DataCite, as well as to movies (eg, "Kung Fu Panda") via EIDRHaving said that, Crossref and DataCite DOIs have been successful in creating tools and services for scholarly publishers.
  • PURLs have seen lots of use in identifying metadata vocabulary and ontology terms.

If most ARKs run on their own resolvers, why is there also a global resolver for ARKs?

Most ARKs are created by organizations that advertise ("publish") them based at their own resolvers. For example, this ARK was published based at the gallica.bnf.fr resolver:

     https://gallica.bnf.fr/ark:/12148/btv1b8449691v/f29

Having to run and maintain your own resolver is the cost of complete autonomy. Using your own resolver also lets you do branding via the hostname, the downside being that brands are transient and tend to make identifiers fragile. Political and even legal (eg, trademarks) pressures may make supporting older branded hostnames, hence their identifiers, difficult.

That's another reason for having the global ARK resolver. People coming across a broken identifier in the future may find its hostname no longer exists, and if it's an ARK they can extract the core identity (starting with "ark:") and present it to the global n2t.net resolver, as in

            https://n2t.net/ark:/12148/btv1b8449691v/f29

To avoid the risk of future inconvenience, an organization – even one that runs its own resolver – may choose from the outset to advertise ("publish") its ARKs based at n2t.net.

Why doesn't the global ARK resolver (n2t.net) have the word "ARK" in it?

When demand for a global ARK resolver arose, basic principles of openness and generality prevented the designers from creating yet another silo in the DOI/Handle/PURL mold. Instead, the ARK resolver was built to be a generic, scheme-agnostic resolver called N2T (Name-to-Thing), which now resolves over 600 types of identifier, including ARKs, DOIs, Handles, PURLs, URNs, ORCIDs, ISSNs, etc. Resolution is essentially looking in a table for an identifier string, regardless of type, and redirecting it to the right place.

The same basic principles guided the design of an earlier tool called noid, which was built for ARKs but is also regularly used by organizations that mint Handles.

What do you mean by silos?

Typically, scheme-based services are designed as silos, or closed platforms, serving a particular identifier type such as Handle, DOI, or PURL. Each silo performs the same main functions – mapping names (identifiers strings) to things (objects or metadata). Excluding all but one type of identifier string may help to capture markets, but it's wasteful and non-inclusive. It requires building the same set of services over and over for each type and violates basic principles of openness.

In contrast the N2T (Name-to-Thing) resolver and EZID (identifiers made easy) management interface were designed to work with all identifiers. Effort put into any new feature can be efficiently leveraged across all types, which sometimes creates surprising flexibility. For example, ARKs are often stored in EZID with "DOI metadata", and every DOI stored in N2T can benefit from "ARK resolution features" such as inflections and suffix passthrough, which are not available via the main DOI resolver (doi.org).

I've heard of ORCIDs, RORs, and UUIDs – where do they fit in?

Those are special kinds of persistent identifiers. ORCIDs (Open Researcher and Contributor Identifiers) only identify researchers, and they link to research works using ARKs, DOIs, etc. ORCIDs look like

     https://orcid.org/0000-0001-7604-8041

ROR (Research Organization Registry) identifiers designate organizations. For example, here's the California Digital Library:

     https://ror.org/03yrm5c26

UUIDs are globally unique, 37-character strings that are easy for software to generate but only become usable as web addresses when made part of a URL, for example, in this ARK:

           https://somehost.example.com/3c2e39526-e0c3-41ae-be4f-07558a9458eb

While embedding a UUID in an ordinary URL makes it actionable ("clickable"), you could expect more if it were embedded in an ARK such as

           https://n2t.net/ark:/65665/3c2e39526-e0c3-41ae-be4f-07558a9458eb

As an ARK, for example, that UUID should return metadata (if available) and be insensitive to the hyphens, making this form equally viable:

     https://n2t.net/ark:/65665/3c2e39526e0c341aebe4f07558a9458eb


  • No labels

2 Comments

  1. Some really tiny fussy comments on this enormous work (smile)

    Why would I use ARKs compared to, for example, DOIs?

    I wonder if the bullet point "To integrate with the Data citation Index and ORCID.org researcher profiles" is not related to services like EZID rather that ARKs themselves?

    Another question: should the PURL example be https://purl.org/12345 or https://purl.org/99999/12345 ? PURL partitions its namespace to avoid collisions and ensure redirects, e.g. dc for dublin core properties, so I am wondering whether the name assigning authority is purl.org or the "dc" part? (But being no PURL expert I might get it wrong here)

  2. Did you see question 6 about ARKs vs DOIs? Perhaps it's not enough in-depth?

    The point about the Data Citation Index and ORCID.org is that ARKs are showing up in parts of the citation ecosystem and have appeared in both places. For example, I've entered ARKs via the ORCID drop-down menu (which lists ARK as a choice). It happens that EZID was the source for the ARKs I know about in the Data Citation Index, but they have no agreement with EZID and harvested from EZID independently. I don't know how the DCI operates exactly.

    Yes, you're right about the PURL example, which I will change!