Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it's important to ensure that your DSpace instance is indexed effectively. Here is how.

Very briefly:

*DSpace 1.5 and 1.5.1 ship with a bad robots.txt. Remove the line that reads Disallow: /browse. If you do not, your site will not be correctly indexed.

...

Just set up a cron job (or scheduled task in Windows), e.g. (cron):

<tt>0 0 6 * * * dspace/bin/generate-sitemaps</tt>sitemaps

Also, if you've customized your site's look and feel (as most have), ensure that there is a link to dspace-url/htmlmap in your DSpace's front or home page. (This is present in the footer of the default template.) E.g.:

<tt><a <a href="/htmlmap"></a></tt>a>

Search engines will now look at /htmlmap, which (after generate-sitemaps has been run) serves one or more pre-generated (and thus served with minimal impact on your hardware) HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

...

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

  • <tt>/bitstream</tt>bitstream
  • <tt>/browse</tt> browse UNLESS USING SITEMAPS<tt>
  • /browse-date</tt> date UNLESS USING SITEMAPS
  • <tt>/community-list</tt> list UNLESS USING SITEMAPS<tt>
  • /handle</tt>handle<tt>
  • /html</tt>html<tt>
  • /htmlmap</tt>htmlmap

Example good robots.txt

If you do not provide sitemaps, the only way for crawlers to reach content in this case is via the browse pages. You can use this to prevent crawlers hitting all but the date browse pages (the most reliable for indexing) in DSpace 1.4 or earlier. (Selective blocking is not possible in 1.5, but if you're on 1.5 you can use sitemaps.)


 User-agent: *
 Disallow: /browse-subject
 Disallow: /browse-author
 Disallow: /*/browse-subject
 Disallow: /*/browse-author
Panelcode
No Format

Metadata in the HTML HEAD

...

If you're using DSpace 1.5 and the JSP UI, or 1.5.2 with either UI, you should already see this if you VIEW SOURCE in your browser on an item display page, e.g.:


 <meta name="DC.type" content="Article" />
 <meta name="DCTERMS.contributor" content="Tansley, Robert" />
Panelcode
No Format

If you don't see anything like this, ensure that the following is in your site's <tt>layoutlayout/header-default.jsp</tt>jsp, within the <head> element:


 <% if (extraHeadData != null)
    \{ %>
 <%= extraHeadData %>
 <% \} %>
Panelcode
No Format

If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying <tt>dspace/config/crosswalks/xhtml-head-item.properties</tt>properties.

Don't worry about OAI-PMH

...

  • No reliable way to determine OAI-PMH base URL for a DSpace site.
  • No standard or predictable way to get to item display page or full text from an OAI-PMH record, making effective indexing and presenting meaningful results difficult.
  • In most cases provides only access to simple Dublin Core, a subset of available metadata.

...