Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info

The following instructions describe how to run the Mill for development purposes. If you are deploying the Mill in a production environment, see the documentation here: https://github.com/duracloud/deployment-docs/blob/master/mill-setup.md.


This article describes the necessary steps for configuring and running the DuraCloud Mill. The DuraCloud Mill is a collection of applications that work in concert with DuraCloud to perform a variety of bulk processing tasks including audit log writing,  manifest maintenance, duplication and bit integrity checks. While the Mill has been designed to run in an auto-scalable environment such as AWS EC2,  it can also be run as a suite of stand-alone applications on any machine with sufficient computing resources.  This article only describes the various applications and how to configure them; it does not cover approaches to leveraging AWS autoscaling and supporting dev opts tools such as Puppet.

If you are not yet familiar with the DuraCloud Mill please refer to the DuraCloud Architecture document, which describes the purpose and primary components of the Mill.

...

Once you have an instance of workman running you can perform an explicit duplication run.  The spaces that have been configured with duplication policies (see the Mill Overview for details) will generate duplication events when the audit tasks associated with them are processed.  If you add a new duplication policy to a new space that already has content items,  you'll need to perform a duplication run to ensure that those new items get duplicated. The loopingduptaskproducer fulfills this function. Based on the set of duplication policies, it will generate duplication tasks for all matching spaces.  It will keep track of which accounts, spaces and items have been processed in a given run so it does not need to run in daemon mode. It will run until it has reached the max number of allowable items on the queue and then it will exit. The next time it is run, it will pick up where it left off. You may want to dial down the max queue size in the event that you have so many items and so little computing power to process them with that you may exceed the maximum life of an SQS message (which happens to be 14 days).  It should also be noted here that items are added roughly one thousand at a time for each space in a round-robin fashion to ensure that all spaces are processed in a timely way.   This strategy ensures that small spaces that are flanked by large spaces are processed quickly.   It is also important that only one instance of loopingduptaskproducer is running at any moment in time.     Two settings to be concerned with when it comes to the looping dup task producer: 

...