All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
Info |
---|
Please be aware that individual search engines also have their own guidelines and recommendations for inclusion. While the guidelines below apply to most DSpace sites, you may also wish to review these guidelines for specific search engines:
|
Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.
DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup. Here's how to ensure your site is indexed.
For the optimum indexing, you should:
Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.
DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup. Here's how to ensure your site is indexed.
For the optimum indexing, you should:
...
...
...
We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:
Additional minor improvements / bug fixes have been made to more recent releases of DSpace.
First ensure your DSpace instance is visible, e.g.
Additional minor improvements / bug fixes have been made to more recent releases of DSpace.
First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus
...
Once you've enabled your sitemaps, they will be accessible at the following URLs:
[dspace.url]/sitemap
[dspace.url]/
sitemaphtmlmap
So, for example, if your "dspace.url = http://mysite.org/xmlui" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/xmlui/htmlmap"
...
Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap
in your DSpace's front or home page. By default, both the JSPUI and XMLUI provide this link in the footer:
Code Block |
---|
<a href="/htmlmap"></a> |
Announce your sitemap in your robots.txt. Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file. By default, both the JSPUI and XMLUI provide these references in their robots.txt file. For example:
Code Block |
---|
Sitemap: http://my.dspace.url/sitemap
Sitemap: http://my.dspace.url/htmlmap |
# The FULL URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
Sitemap: [dspace.url]/sitemap
Sitemap: [dspace.url]/htmlmap |
Search engines will now look at your XML and HTML sitemaps, which serve pre-generated (and thus served with minimal impact on your hardware) XML or HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.
...
Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).
Warning |
---|
DSpace 1.5 and 1.5.1 ship with a bad robots.txt file. Delete it, or specifically the line that says Disallow: /browse. If you do not, your site will not be correctly indexed. |
Some URLs can Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.
...
Below is an example good robots.txt. The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
Code Block |
---|
User-agent: * # Disable access# The FULL URL to Discoverythe searchDSpace and filters Disallow: /discover Disallow: /search-filter # This should be the FULL URL to your HTML Sitemap. sitemaps # XML sitemap is listed first as it is preferred by most search engines # Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file. Sitemap: http://[dspace.url]/sitemap Sitemap: [dspace.url]/htmlmap ########################## # IfDefault you have configured DSpace (Solr-based) Statistics to be publicly accessible, # then you likely do not want this content to be indexed # Disallow: /displaystats # Uncomment the following line ONLY if sitemaps.org or HTML sitemaps are usedAccess Group # (NOTE: blank lines are not allowable in a group record) ########################## User-agent: * # Disable access to Discovery search and filters Disallow: /discover Disallow: /search-filter # For JSPUI, replace "/search-filter" above with "/simple-search" # # Optionally uncomment the following line ONLY if sitemaps are working # and you have verified that your site is being indexed correctly. # Disallow: /browse # # YouIf alsoyou mayhave wishconfigured to disallow accessDSpace (Solr-based) Statistics to the following paths, in orderbe publicly # toaccessible, stopthen webyou spidersmay fromnot accessingwant user-basedthis content: # Disallow: /advanced-search to be indexed # Disallow: /contactstatistics # Disallow: /feedback # Disallow: /forgot # Disallow: /login # Disallow: /register # Disallow: /search |
It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace (both XMLUI and JSPUI) includes item metadata in the <head> element of each item's HTML display page.
Code Block |
---|
<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" /> |
If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying [dspace]/config/crosswalks/xhtml-head-item.properties
.
In addition to Dublin Core <meta> tags in the HTML HEAD, DSpace also includes Google Scholar specific metadata fields in each item's HTML display page.
Code Block |
---|
<meta content="Tansley, Robert; Donohue, Timothy" name="citation_authors" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />
|
These meta tags are the "Highwire Press tags" which Google Scholar recommends. If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties
Much more information is available in the Configuration section on Google Scholar Metadata Mappings.
Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page. In the past, some DSpace sites have added these custom URL redirects in order to facilitate capturing statistics via Google Analytics or similar.
While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine.
You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register |
WARNING: for your additional disallow statements to be recognized under the User-agent: *
group, they cannot be separated by white lines from the declared user-agent: *
block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.
This is OK:
Code Block |
---|
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
Disallow: /statistics
Disallow: /contact |
This is not OK, as the two lines at the bottom will be completely ignored.
Code Block |
---|
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
Disallow: /statistics
Disallow: /contact |
To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.
For more information on the robots.txt format, please see the Google Robots.txt documentation.
It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace (both XMLUI and JSPUI) includes item metadata in the <head> element of each item's HTML display page.
Code Block |
---|
<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" /> |
If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying [dspace]/config/crosswalks/xhtml-head-item.properties
.
In addition to Dublin Core <meta> tags in the HTML HEAD, DSpace also includes Google Scholar specific metadata fields in each item's HTML display page.
Code Block |
---|
<meta content="Tansley, Robert" name="citation_author" />
<meta content="Donohue, Timothy" name="citation_author" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />
... |
These meta tags are the "Highwire Press tags" which Google Scholar recommends. If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties
Much more information is available in the Configuration section on Google Scholar Metadata Mappings.
Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page. In the past, some DSpace sites have added these custom URL redirects in order to facilitate capturing statistics via Google Analytics or similar.
While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine.
If you have these URL redirects in place, it is highly recommended to remove them immediately. If you created these redirects to facilitate capturing download statistics in Google Analytics, you should consider upgrading to DSpace 5.0 or above, which is able to automatically record bitstream downloads in Google Analytics (see DS-2088) without the need for any URL redirects.
While DSpace offers a PDF Citation Cover Page option, this option may affect your content's visibility in search engines like Google Scholar. Google Scholar (and possibly other search engines) specifically extracts metadata by analyzing the contents of the first page of a PDF. Dynamically inserting a custom cover page can break the metadata extraction techniques of Google Scholar and may result in all or much of your site being dropped from the Google Scholar search engine.
For more information, please see the "Indexing Repositories: Pitfalls and Best Practices" talk from Anurag Acharya (co-creator of Google Scholar) presented at the Open Repositories 2015 conferenceIf you have these URL redirects in place, it is highly recommended to remove them immediately. If you created these redirects to facilitate capturing download statistics in Google Analytics, you should consider upgrading to DSpace 5.0 or above, which is able to automatically record bitstream downloads in Google Analytics (see DS-2088) without the need for any URL redirects.
...