By Robert Tansley, Google engineer and architect of DSpace 1.0

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it's important to ensure that your DSpace instance is indexed effectively. Here is how.

Very briefly:

DSpace 1.5 and 1.5.1 ship with a bad robots.txt. Remove the line that reads Disallow: /browse. If you do not, your site will not be correctly indexed.

Upgrade to the latest possible DSpace.
Ensure that your DSpace is visible to search engines.
Use the simple HTML sitemap feature – this does not require e.g. registering with Google Webmaster tools.
Ensure your robots.txt allows access to item "splash" pages and full text.
Ensure item metadata appears in HTML headers correctly.
Don't worry about OAI-PMH; it is not particularly useful for indexing. Really.

Contents

Upgrade to the latest possible DSpace.

As of DSpace 1.4, DSpace has support for the "if-modified-since" HTTP header. This basically means that if an item (or bitstream therein) has not changed since the last time a search engine's crawler indexed it, that item/bitstream does not have to be re-retrieved, sparing your server.

As of DSpace 1.5, DSpace has support for sitemaps (both simple HTML pages of links, as well as the sitemaps.org protocol). It also includes item metadata in the HTML HEAD element of item display pages, ensuring that the metadata can be effectively indexed no matter what changes you might have made to your DSpace's layout or style.

As of DSpace 1.7, DSpace has improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) that Google Scholar recommends in their Indexing Guidelines.

Ensure your content can be indexed

First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus

If your site is not indexed at all, all search engines have a way to add your URL, e.g.:

Google: http://www.google.com/addurl
Yahoo: http://siteexplorer.search.yahoo.com/submit
Bing: http://www.bing.com/docs/submit.aspx

Add HTML Sitemap support.

This is as simple as running \[dspace\]/bin/generate-sitemaps once a day.

Just set up a cron job (or scheduled task in Windows), e.g. (cron):

{{0 6 * * * \[dspace\]/bin/generate-sitemaps}}

Also, if you've customized your site's look and feel (as most have), ensure that there is a link to \[dspace-url\]/htmlmap in your DSpace's front or home page. (This is present in the footer of the default template.) E.g.:

<a href="/htmlmap"></a>

Search engines will now look at /htmlmap, which (after generate-sitemaps has been run) serves one or more pre-generated (and thus served with minimal impact on your hardware) HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

Create a good robots.txt

DSpace 1.5 and 1.5.1 ship with a bad robots.txt file. Delete it, or specifically the line that says Disallow: /browse. If you do not, your site will not be correctly indexed.

The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.

If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).

DO NOT BLOCK THESE

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

/bitstream
/browse UNLESS USING SITEMAPS
/*/browse UNLESS USING SITEMAPS
/browse-date UNLESS USING SITEMAPS
/*/browse-date UNLESS USING SITEMAPS
/community-list UNLESS USING SITEMAPS
/handle
/html
/htmlmap

Example good robots.txt

If you do not provide sitemaps, the only way for crawlers to reach content in this case is via the browse pages. You can use this to prevent crawlers hitting all but the date browse pages (the most reliable for indexing) in DSpace 1.4 or earlier. (Selective blocking is not possible in 1.5, but if you're on 1.5 you can use sitemaps.)

 User-agent: *
 Disallow: /browse-subject
 Disallow: /browse-author
 Disallow: /*/browse-subject
 Disallow: /*/browse-author

Metadata in the HTML HEAD

It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace 1.5 includes item metadata in the HEAD element of each item's HTML display page.

If you're using DSpace 1.5 and the JSP UI, or 1.5.2 with either UI, you should already see this if you VIEW SOURCE in your browser on an item display page, e.g.:

 <meta name="DC.type" content="Article" />
 <meta name="DCTERMS.contributor" content="Tansley, Robert" />

If you don't see anything like this, ensure that the following is in your site's layout/header-default.jsp, within the <head> element:

 <% if (extraHeadData != null)
    { %>
    <%= extraHeadData %>
 <% } %>

If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying {{\[dspace\]/config/crosswalks/xhtml-head-item.properties}}.

In general, OAI-PMH isn't useful to Search Engines

Feel free to support OAI-PMH, but be aware that in general it is not useful for search engines:

No reliable way to determine OAI-PMH base URL for a DSpace site.
No standard or predictable way to get to item display page or full text from an OAI-PMH record, making effective indexing and presenting meaningful results difficult.
In most cases provides only access to simple Dublin Core, a subset of available metadata.

NOTE: Back in 2008, Google officially announced they were retiring support for OAI-PMH based Sitemaps. So, OAI-PMH will no longer help you get better indexing through Google. Instead, you should be using the DSpace 'generate-sitemaps' feature described above.

Other methods

See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/Web_Analytics