Islandora can provide a robust and full-featured institutional repository solution. The following documentation inventories the range of modules and configurations that enable core institutional repository functionality in Islandora.
The Islandora Scholar Solution Pack is a suite of modules that provides the foundation for building your institutional repository in Islandora. The modules that comprise this solution pack:
DOI can be added to a metadata entry/ editing form as an identifier field. See Islandora XML Form builder. There is a DOI batch ingest module that allows a user to upload a .txt file of comma separated DOIs or compile a list of DOIs in a GUI for ingest. Some community members have built custom workflows that mint DOIs when a research object is published or embargoed.
The Islandora Scholar Profiles module uses Person and Organization entities to support the creation of individual author profiles. Profiles can display biographical information drawn from MADS metadata and a list of all items in the repository that have an association with the author. The module also automatically generates a Scholar’s Directory page that displays a searchable, sortable, filterable list of all authors in the repository. The Scholar Profiles Module:
The ISLE Matomo Docker image and Islandora Matomo module and optional plug-in capture usage statistics and automatically generate author dashboards and real-time usage maps. Matomo can be configured to collect information for individual authors, departments, publications or other segments of your repository. The module also generates automatic, monthly readership reports that can be sent to authors via email.
The Islandora Scholar suite of modules provides support for thesis/dissertations and citations. ETDs can be batch ingested via DOI, PMID, EndNote, RDF, etc. The Islandora Webform Module supports self-deposit workflows.
The Islandora Metadata Export module automatically creates an embeddable block with metadata downloads in a range of formats, including BibTex, EndNote, MARC, MARCXML, DublinCore, RIS, and JSON.
Search engines like Google, GoogleScholar, and Bing likely drive significant traffic to your institutional repository. In order to return relevant search results, search engines crawl and index site content. Repository administrators can implement a few key practices to ensure that search engines properly index their sites. Proper indexing in search engines will improve your repository's visibility on the web, driving more traffic to your site, and encouraging future deposits. The following practices promote efficient site indexing, making your content more discoverable to crawlers while reducing strain on your site.
Google Scholar requires a browse interface that allows its search robots to discover the URLs of articles in your repository. The Islandora Scholar module automatically generates the following views of your content, as recommended by Google Scholar.
Google Scholar recommends that the URL of every article is reachable from the homepage by following at most ten simple HTML links. Some institutions report success with placing hidden links to the above landing pages on prominent pages in their repository.
The Scholar solution pack also adds meta tags to citation and thesis object pages to be crawled by GoogleBots. Tags are also recognized by Zotero.
Customizing your repository's look and feel can make it harder for search engines (and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank) to correctly identify item metadata fields. To address this, the Islandora Google Scholar Module automatically places item metadata in the element of each item's HTML display page.
<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" />
If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying [islandora]/config/crosswalks/xhtml-head-item.properties.
Google Scholar requires publications to have title, at least one author, and a publication date for inclusion. The Islandora Google Scholar Module includes these required fields in each item's HTML display page.
<meta content="Tansley, Robert" name="citation_author" />
<meta content="Donohue, Tim" name="citation_author" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />
<meta content="2018" name="citation_publication_date" />
Tip:
The Islandora JSON-LD Module creates a standards-compliant JSON-LD record for an object based on a set of predefined MODS XPaths. Metadata structured as JSON-LD is increasingly important for dataset discovery through services like Google’s Dataset Search.
The XML Sitemap module allows Islandora to expose its content in a way that search engines can easily crawl. Sitemaps allow crawlers to index your site without having to visit every page in your repository, meaning they can dow their work more quickly and efficiently). Without sitemaps, search engine indexing activity may significantly tax your repository. The Islandora XML Sitemap module works in conjunction with the Drupal XML Sitemap Custom module to automatically include Islandora objects in the Drupal sitemap. Detailed documentation on installing and configuring this module is available at https://wiki.lyrasis.org/display/ISLANDORA/Islandora+XML+Sitemap.
Even if you've enabled the Islandora XML Sitemap module, search engines may not locate your sitemaps unless you provide a direct link. There are two main options for directing search engines to your sitemaps.
1. Provide a hidden link to the sitemaps on your repository's homepage. If you've customized your site's theme, ensure that there is a link to /htmlmap on your front or home page.
<a href="/htmlmap"></a>
2. Announce your sitemap in your robots.txt. For example:
# The FULL URL for your sitemaps (HTML and XML)
# XML sitemap is listed first as it is preferred by most search engines
# Make sure to replace "[islandora.url]" with the value of your 'islandora.url' setting in your islandora.cfg file.
Sitemap: [islandora.url]/sitemap.xml
Sitemap: [islandora.url]/htmlmap
3. Confirm that the sitemap was successfully submitted to Google via the Search Console (Crawl > Sitemap). Note that it may take a few days for Google to start crawling, and perhaps many days for it to finish, based on the priority Google assigns your content for indexing. If necessary, you can also manually submit a sitemap through the Search Console. Visit the Sitemap Report and enter the relative URL for the sitemap. Google should process the sitemap immediately, but site indexing may take some time. Follow the same steps to resubmit a sitemap if there are significant changes to your site structure.
Tips:
Sitemap:
lines can be placed anywhere in your robots.txt file. You can specify multiple "Sitemap:" lines, so that search engines can locate XML and HTML formats. For more information, see: http://www.sitemaps.org/protocol.html#informingSitemap:
line. Relative paths are not supported.New Islandora releases may include improvements to indexing tools.
Your site should include a robots.txt file, which indicates to search engine crawlers which pages or files they can request from your site. A robust robots.txt file must strike a balance between overloading your server with crawler requests and ensuring access to the content needed to comprehensively index your site. In the case of a repository, crawlers should be able to index item, collection and community pages, and all bitstreams within items. Crawlers access your site as an anonymous user; they will not be able to access restricted content.
Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/islandora/robots.txt. If your Islandora instance is served from e.g. http://repo.foo.edu/islandora/, you'll need to add /islandora to all the paths in the examples below (e.g. /islandora/browse-subject).
You may wish to block crawlers from accessing some URLs in your repository that do not contain relevant information, such as log-in or registration pages or contact and feedback forms. However, blocking certain URLs can impede crawlers from properly indexing your site. Never put the following paths on Disallow:
lines, or your repository might not be indexed properly:
Tips:
Disallow:
lines are case sensitive.Full-text access is critically important for effective indexing, enabling keyword searching and citation analysis among other functions.
The following example robots.txt includes highly recommended settings (uncommented) and additional optional settings (in comments). Your local configuration will determine whether you choose to enable optional settings. To do so, uncomment the corresponding Disallow:
line.
# # robots.txt # # This file is to prevent the crawling and indexing of certain parts # of your site by web crawlers and spiders run by sites like Yahoo! # and Google. By telling these "robots" where not to go on your site, # you save bandwidth and server resources. # # This file will be ignored unless it is at the root of your host: # Used: http://example.com/robots.txt # Ignored: http://example.com/site/robots.txt # # For more information about the robots.txt standard, see: # http://www.robotstxt.org/robotstxt.html User-agent: * Crawl-delay: 10 # The FULL URL to the Islandora sitemaps # XML sitemap is listed first as it is preferred by most search engines # Make sure to replace "[islandora.url]" with the value of your 'islandora.url' setting in your islandora.cfg file. Sitemap: [islandora.url]/sitemap.xml Sitemap: [islandora.url]/htmlmap # CSS, JS, Images Allow: /misc/*.css$ Allow: /misc/*.css? Allow: /misc/*.js$ Allow: /misc/*.js? Allow: /misc/*.gif Allow: /misc/*.jpg Allow: /misc/*.jpeg Allow: /misc/*.png Allow: /modules/*.css$ Allow: /modules/*.css? Allow: /modules/*.js$ Allow: /modules/*.js? Allow: /modules/*.gif Allow: /modules/*.jpg Allow: /modules/*.jpeg Allow: /modules/*.png Allow: /profiles/*.css$ Allow: /profiles/*.css? Allow: /profiles/*.js$ Allow: /profiles/*.js? Allow: /profiles/*.gif Allow: /profiles/*.jpg Allow: /profiles/*.jpeg Allow: /profiles/*.png Allow: /themes/*.css$ Allow: /themes/*.css? Allow: /themes/*.js$ Allow: /themes/*.js? Allow: /themes/*.gif Allow: /themes/*.jpg Allow: /themes/*.jpeg Allow: /themes/*.png # Directories Disallow: /includes/ Disallow: /misc/ Disallow: /modules/ Disallow: /profiles/ Disallow: /scripts/ Disallow: /themes/ # Files Disallow: /CHANGELOG.txt Disallow: /cron.php Disallow: /INSTALL.mysql.txt Disallow: /INSTALL.pgsql.txt Disallow: /INSTALL.sqlite.txt Disallow: /install.php Disallow: /INSTALL.txt Disallow: /LICENSE.txt Disallow: /MAINTAINERS.txt Disallow: /update.php Disallow: /UPGRADE.txt Disallow: /xmlrpc.php ########################## # Default Access Group # (NOTE: blank lines are not allowable in a group record) # trailing slash sections courtesy of advice from https://www.volacci.com/blog/fix-problems-drupal-default-robotstxt-file ########################## # Paths (clean URLs) Disallow: /admin/ Disallow: /comment/reply/ Disallow: /contact/ Disallow: /logout/ Disallow: /node/add/ Disallow: /search/ Disallow: /user/register/ Disallow: /user/password/ Disallow: /user/login/ # Paths (no clean URLs) Disallow: /?q=admin/ Disallow: /?q=comment/reply/ Disallow: /?q=contact/ Disallow: /?q=logout/ Disallow: /?q=node/add/ Disallow: /?q=search/ Disallow: /?q=user/password/ Disallow: /?q=user/register/ Disallow: /?q=user/login/ # Paths (clean URLs) – no trailing forward slash Disallow: /admin Disallow: /comment/reply Disallow: /contact Disallow: /logout Disallow: /node/add Disallow: /search Disallow: /user/register Disallow: /user/password Disallow: /user/login # Paths (no clean URLs) – no trailing forward slash Disallow: /?q=admin Disallow: /?q=comment/reply Disallow: /?q=contact Disallow: /?q=logout Disallow: /?q=node/add Disallow: /?q=search Disallow: /?q=user/password Disallow: /?q=user/register Disallow: /?q=user/login
WARNING: for your additional disallow statements to be recognized under the User-agent: *
group, they cannot be separated by blank lines from the declared user-agent: *
block. A blank line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.
This is OK:
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
Disallow: /statistics
Disallow: /contact
This is not OK, as the two lines at the bottom will be completely ignored.
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
Disallow: /statistics
Disallow: /contact
Tips:
Some repositories use custom URL redirects in order to facilitate capturing web analytics (e.g., Google Analytics). While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine. Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page.
Development of the Islandora JSON-LD module, Islandora Matomo Docker image, Islandora Matomo module and real-time map plug-in, Islandora Scholar Profiles module, and Islandora Metadata Export module, as well as enhancements to the Islandora Webform module and the Islandora Scholar solution pack was supported by a grant from the Andrew W. Mellon Foundation. To learn more about this work, visit the LASIR project page.
Parts of this documentation were adapted from the Samvera Community’s Digital Commons Feature Matrix.