All Versions


DSpace Documentation


Page tree

Old Release

This documentation relates to an old version of DSpace, version 5.x. Looking for another version? See all documentation.

Skip to end of metadata
Go to start of metadata

Added in DSpace 3.0 is an optional statistics engine using Elasticsearch, which may be enabled as an alternative to the default SOLR Statistics engine (based on Apache SOLR). The motivation for adding Elasticsearch was to find an alternative statistics processing engine that would handle the workload of a large amount of statistics data. Additionally, the Elasticsearch statistics display offers another method for creating statistical queries against your data. Elasticsearch Usage Statistics has been contributed by Peter Dietz of Ohio State University's Knowledge Bank. The data source for Elasticsearch Statistics is DSpace Usage Events, where Usage Event is a view or download of a DSpace Object (Bitstream, Item Page, Collection Page, Community Page). Elasticsearch Statistics is bundled with DSpace, and requires no additional installation of software, it just needs to be enabled. Elasticsearch is only available for use with XMLUI. 

What data is being recorded?

The default information below is what DSpace will record about a Usage Event. In DSpace 3.0 the fields of data collected is not configurable through a configuration setting.

Information about the User Requesting the Content

  • IP Address

  • Time of Request
  • DNS / Hostname
  • User Agent
  • isBot, a flag that DSpace thinks that user is a robot or not
  • Geographical Information about where the user is located: 
    • Continent
    • Country
    • Country Code
    • City
    • Geographical Latitude/Longitude

Information about the DSpace Resource that was used

  • DSpace Object ID
  • DSpace Object Type: (Item, Bitstream, Collection, or Community)
  • If it is relevant, we also store the hierarchy of where this object exists within DSpace
    • Owning Community
    • Owning Collection
    • Owning Item

Enabling Elasticsearch Statistics

Elasticsearch Statistics is disabled by default in DSpace 3.0, the following steps will enable Elasticsearch so that you can collect data, and present statistics reports.

 

Modify dspace/config/xmlui.xconf, and uncomment the aspect, Statistics Elasticsearch. 

Enable Elastic Search Statistics in dspace/config/xmlui.xconf
        <!--
             If you prefer to use "Elasticsearch" Statistics, you can uncomment the below
             aspect and COMMENT OUT the default "Statistics" aspect above.
             You must also enable the ElasticSearchLoggerEventListener.
        -->
        <!-- <aspect name="Statistics - Elasticsearch" path="resource://aspects/StatisticsElasticSearch/" /> -->

 

Modify dspace-xmlui/src/main/webapp/WEB-INF/spring/applicationContext.xml and uncomment the following code block for ElasticSearchLoggerEventListener

Enable ElasticSearchLoggerEventListener
<!-- Elasticsearch -->
<!--<bean class="org.dspace.statistics.ElasticSearchLoggerEventListener">
        <property name="eventService">
            <ref bean="dspace.eventService" />
        </property>
    </bean>-->

 

After making these two changes, you will then need to rebuild and restart DSpace.

 

Importing Legacy Data into Elasticsearch Statistics

Once Elasticsearch Statistics has been enabled, it will begin adding all new Usage Events to its data store. To import your legacy data, you will need to import the data from the dspace.log files. There is no tool yet that converts SOLR statistics data to Elasticsearch statistics data.

From the (Windows / Linux) terminal, you will need to use the DSpace Command Launcher to convert the dspace.log files to a statistics log format. Then you will need to import the statistics log format files into DSpace Statistics.

The Log Converter program converts log files from dspace.log into an intermediate format that can be inserted into Elasticsearch Statistics.

 

Command used:

[dspace]/bin/dspace stats-log-converter

Java class:

org.dspace.statistics.util.ClassicDSpaceLogConverter

Arguments short and long forms):

Description

-i or --in

Input file

-o or --out

Output file

-m or --multiple

Adds a wildcard at the end of input and output, so it would mean if -i dspace.log -m was specified, dspace.log* would be converted. (i.e. all of the following: dspace.log, dspace.log.1, dspace.log.2, dspace.log.3, etc.)

-n or --newformat

If the log files have been created with DSpace 1.6 or newer

-v or --verbose

Display verbose output (helpful for debugging)

-h or --help

Help

An example form of this command would be [dspace]/bin/dspace stats-log-converter -i dspace.log -o statistics.log -m -n


The Log Importer program takes the intermediate format data produced in the previous step, and imports it into Elasticsearch Statistics.

Command used:

[dspace]/bin/dspace stats-log-importer-elasticsearch

Java class:

org.dspace.statistics.util.StatisticsImporterElasticSearch

Arguments short and long forms):

Description

-i or --in

Input file

-m or --multiple

Adds a wildcard at the end of input and output, so it would mean if -i statistics.log -m was specified, statistics.log* would be imported. (i.e. all of the following: statistics.log, statistics.log.1, statistics.log.2, statistics.log.3, etc.)

-s or --skipdns

To skip the reverse DNS lookups that work out where a user is from. (The DNS lookup finds the information about the host from its IP address, such as geographical location, etc. This can be slow, and wouldn't work on a server not connected to the internet.)

-v or --verbose

Display verbose output (helpful for debugging)

-h or --help

Help

An example form of this command would be [dspace]/bin/dspace stats-log-importer-elasticsearch -i statistics.log -m

Viewing Data in Elasticsearch Statistics

In XMLUI, while logged in as an administrator, the Context Panel will have an additional "View Statistics" link when you browse to a Community, Collection, or Item.

The Statistics Report includes:

  • Bitstreams with Most Downloads, for all time.
  • Bitstreams with Most Downloads, previous month.
  • Total Number of Downloads to Bitstreams within this container, broken down by month.
  • Number of hits per Country

This data is presented as either a Table or Line Graph, and requires JavaScript to draw the graphics.

  • No labels

1 Comment

    • dspace.log contains a lot of ES messages with priority INFO. Document how to turn them down.
    • location of the ES core?
      • [dspace]/elasticsearch
      • Explain structure. What are shards (indexes) and why is there more than 1? How is data distributed among them? By time? Will we need to raise the file descriptor limit when the index grows?
    • I'm seeing the link in admin menu, but not any statistics accessible for the visitors (using Mirage). How can I enable it to the visitors?
      • Document the authorization.admin.* options in usage-statistics.cfg
      • Document elastic-search-statistics.cfg, usage-statistics.cfg and solr-statistics.cfg
    • How does the statistics_viewer group work? Is it possible to display stats publicly by default? What about admins? The statistics_viewer group doesn't exist by default. How does it play together with authorization.admin.*?
    • http vs. https? https://www.google.com/jsapi

    • Is there an admin/search interface like with Solr? Is there REST API? What are its access restrictions?
    • How to use elasticsearch-head?
    • How to use the CSV writer?
      • /handle/123456789/18588/stats/csv/topCountries
      • topCountries / fileDownloads / topDownloads
    • Statistics of deleted items?
    • Statistics show up on communities, but why not at site root?
    • TODO Statistics for views, not just downloads.
    • TODO On the dashboard page, the chart also has space for negative values if no data has been recorded. There shouldn't be much use for negative downloads (smile)
    • TODO i18n
    • TODO Top unique IPs are available in DRI - not good
    • If the ES node crashes (in my case due to running out of disk space) how do I restart the node without restarting DSpace?
    • stats-log-importer-elasticsearch is slooooow
      • basicaly, this will get unbearably slow after a while:
        /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log -m
      • Better do this instead:
        /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.2012-12 -m
        /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.2012-11 -m
        /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.2012-10 -m
        ... 
      • I think the problem might be that you're adding each document (input line) as a bulk item. It would be better to add whole input files or, erhaps even better, a constant number of documents at once:
        https://github.com/DSpace/DSpace/blob/dspace-3.0/dspace-api/src/main/java/org/dspace/statistics/util/StatisticsImporterElasticSearch.java#L284
      • For now, here's a workaround shell script:

        #!/bin/sh
        for year in `seq 2011 2012`; do
          for month in `seq -w 1 12`; do
            if [ "$year" -eq "2011" ]; then
              if [ $month -lt 10 ]; then
                continue
              fi
            fi
            echo "Importing $year-$month"
            /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.$year-$month-0 -m
            /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.$year-$month-1 -m
            /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.$year-$month-2 -m
            /dspace/bin/dspace stats-log-importer-elasticsearch -i /dspace/log/statistics.log.$year-$month-3 -m
          done
        done

    • TODO export from Solr to the intermediate format (statistics.log) that stats-log-converter produces and stats-log-importer-elasticsearch consumes?