Ensuring your DSpace is indexed

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.

DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup. Here's how to ensure your site is indexed.

For the optimum indexing, you should:

Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
Ensure your DSpace is visible to search engines.
Enable the sitemaps feature – this does not require e.g. registering with Google Webmaster tools.
Ensure your robots.txt allows access to item "splash" pages and full text.
Ensure item metadata appears in HTML headers correctly.
As an aside, it's worth noting that OAI-PMH is generally not useful to search engines. OAI-PMH has its own uses, but do not expect search engines to use it.

Keep your DSpace up to date

We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:

As of DSpace 1.7, DSpace has improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) that Google Scholar recommends in their Indexing Guidelines.
As of DSpace 1.5, DSpace has support for sitemaps (both simple HTML pages of links, as well as the sitemaps.org protocol). It also includes item metadata in the HTML HEAD element of item display pages, ensuring that the metadata can be effectively indexed no matter what changes you might have made to your DSpace's layout or style.
As of DSpace 1.4, DSpace has support for the "if-modified-since" HTTP header. This basically means that if an item (or bitstream therein) has not changed since the last time a search engine's crawler indexed it, that item/bitstream does not have to be re-retrieved, sparing your server.

Additional minor improvements / bug fixes have been made to more recent releases of DSpace.

Ensure your DSpace is visible to search engines

First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus

If your site is not indexed at all, all search engines have a way to add your URL, e.g.:

Google: http://www.google.com/addurl
Yahoo: http://siteexplorer.search.yahoo.com/submit
Bing: http://www.bing.com/docs/submit.aspx

Enable the sitemaps feature

DSpace provides a sitemap feature that we highly recommend you enable to ensure proper indexing. Sitemaps allow DSpace to expose its content in a way that makes it easily accessible to search engine crawlers. Sitemaps also help ensure that crawlers do NOT have to visit every page in your DSpace (which means the crawlers can get in and get out quickly, without taxing your site). Without sitemaps, search engine indexing activity may impose significant loads on your repository.

HTML sitemaps provide a list of all items, collections and communities in HTML format, whilst Google sitemaps provide the same information in gzipped XML format.

To enable sitemaps, all you need to do is run [dspace]/bin/dspace generate-sitemaps once a day.

Just set up a cron job (or scheduled task in Windows), e.g. (cron):

# Regenerate sitemaps at 6:00 AM local time each morning
0 6 * * * [dspace]/bin/dspace generate-sitemaps

Once you've enabled your sitemaps, they will be accessible at the following URLs:

HTML Sitemaps: [dspace.url]/htmlmap
Google (XML) Sitemaps: [dspace.url]/sitemap

So, for example, if your "dspace.url = http://mysite.org/xmlui" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/xmlui/htmlmap"

Make your sitemap discoverable to search engines

Even if you've enabled your sitemaps, search engines may not be able to find them unless you provide them with a link. There are two main ways to notify a search engine of your sitemaps:

Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap in your DSpace's front or home page.By default, both the JSPUI and XMLUI provide this link in the footer:
```
<a href="/htmlmap"></a>
```
Announce your sitemap in your robots.txt. Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file. For example:
```
Sitemap: http://my.dspace.url/htmlmap
```
1. NOTE that you need to replace "http://my.dspace.url" above with the full URL of your DSpace instance (this should correspond to the "dspace.url" setting in your dspace.cfg file)
2. This "Sitemap:" line can be placed anywhere in your robots.txt file. For more information, see: http://www.sitemaps.org/protocol.html#informing

Search engines will now look at /htmlmap, which serves one or more pre-generated (and thus served with minimal impact on your hardware) HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

Create a good robots.txt

The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.

If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).

DSpace 1.5 and 1.5.1 ship with a bad robots.txt file. Delete it, or specifically the line that says Disallow: /browse. If you do not, your site will not be correctly indexed.

NEVER BLOCK THESE PATHS

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

/bitstream
/browse (UNLESS USING SITEMAPS)
/*/browse (UNLESS USING SITEMAPS)
/browse-date (UNLESS USING SITEMAPS)
/*/browse-date (UNLESS USING SITEMAPS)
/community-list (UNLESS USING SITEMAPS)
/handle
/html
/htmlmap

Example good robots.txt

Below is an example good robots.txt. The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.

User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover 
Disallow: /search-filter

# This should be the FULL URL to your HTML Sitemap.  
# Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
Sitemap: http://[dspace.url]/htmlmap

# If you have configured DSpace (Solr-based) Statistics to be publicly accessible,
# then you likely do not want this content to be indexed
# Disallow: /displaystats

# Uncomment the following line ONLY if sitemaps.org or HTML sitemaps are used
# and you have verified that your site is being indexed correctly.
# Disallow: /browse

# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content:
# Disallow: /advanced-search
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# Disallow: /search

Ensure Item Metadata appears in the HTML HEAD

It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace (both XMLUI and JSPUI) includes item metadata in the <head> element of each item's HTML display page.

<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" />

If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying [dspace]/config/crosswalks/xhtml-head-item.properties.

Google Scholar Metadata in HTML HEAD

In addition to Dublin Core <meta> tags in the HTML HEAD, DSpace also includes Google Scholar specific metadata fields in each item's HTML display page.

<meta content="Tansley, Robert; Donohue, Timothy" name="citation_authors" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />

These meta tags are the "Highwire Press tags" which Google Scholar recommends. If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties

Much more information is available in the Configuration section on Google Scholar Metadata Mappings.

In general, OAI-PMH is not useful to Search Engines

Feel free to support OAI-PMH, but be aware that in general it is not useful for search engines:

No reliable way to determine OAI-PMH base URL for a DSpace site.
No standard or predictable way to get to item display page or full text from an OAI-PMH record, making effective indexing and presenting meaningful results difficult.
In most cases provides only access to simple Dublin Core, a subset of available metadata.
NOTE: Back in 2008, Google officially announced they were retiring support for OAI-PMH based Sitemaps. So, OAI-PMH will no longer help you get better indexing through Google. Instead, you should be using the DSpace 'generate-sitemaps' feature described above.

All Versions

DSpace Documentation

Page tree

Ensuring your DSpace is indexed

Keep your DSpace up to date

Ensure your DSpace is visible to search engines

Enable the sitemaps feature

Make your sitemap discoverable to search engines

Create a good robots.txt

NEVER BLOCK THESE PATHS

Example good robots.txt

Ensure Item Metadata appears in the HTML HEAD

Google Scholar Metadata in HTML HEAD

In general, OAI-PMH is not useful to Search Engines

All Versions

DSpace Documentation

Page tree

Search Engine Optimization

Ensuring your DSpace is indexed

Keep your DSpace up to date

Ensure your DSpace is visible to search engines

Enable the sitemaps feature

Make your sitemap discoverable to search engines

Create a good robots.txt

NEVER BLOCK THESE PATHS

Example good robots.txt

Ensure Item Metadata appears in the HTML HEAD

Google Scholar Metadata in HTML HEAD

In general, OAI-PMH is not useful to Search Engines