Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Bit Integrity Check Task Processor

Bit integrity check tasks operate on a single content item at a time.   It will download the content item,  calculate the checksum on the downloaded file, and then compare that value to the storage provider's checksum as well as those stored for the item in the audit log and content-index.  The results, pass or fail are recorded in the BitLog database.  See the table below for various error conditions, how they might have come about, and how they are resolved.

#ContentStorageContent IndexAudit LogOutcomeHow did it happen?
1

(tick)

N/A

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

add bit log item: successall went as planned
2(error)(tick)(tick)(tick)

add bit log item: failure

add item to ResolutionTask queue (may be internally resolvable if secondary store available)

The content went sour
3(tick)(error)(tick)(tick)

add bit log item: failure

add item to ResolutionTask queue (externally resolvable: contact storage provider)
The storage provider's checksum process failed
4N/A(error)(tick)(tick)

If last retry, wait 5 minutes before trying again

If last retry, then generate bit error.

The storage provider's checksum process failed, the audit log is backed up, or an audit task was dropped.

5(tick)(tick)(error) or null(tick)

add bit log item: failure

update content index in place

if the audit log properties are null, use storage provider properties to patch audit log item and content index.

The content index was corrupted because an update failed or the checksum itself was corrupted in the process of update
6

(tick) or

N/A

(tick)(tick)(error)

add bit log item: failure

add item to ResolutionTask queue (internally resolvable: audit log out of sync)

The audit log item was corrupted because an insert failed or the checksum itself was corrupted in the process of insertion into Dynamo
7

(tick) or

N/A

(tick)(tick)null

add bit log item: failure

Add item to audit queue

The audit index was corrupted because an insert failed silently under the AWS covers or the item was manually deleted.
8404404(tick)(tick)

If penultimate retry, wait 5 minutes before putting back on queue.

If last retry, then generate bit error.

 

The item was removed in the Storage Provider, but not captured by DuraCloud (yet)
9(tick)(tick)nullnull

If penultimate retry, wait 5 minutes before putting back on queue.

Otherwise log error and add to the audit queue.

The item was added in the Storage Provider, but not captured by DuraCloud (yet)
10(tick)(tick)(error)(error)

If penultimate retry, wait 5 minutes before putting back on queue.

Otherwise log error and add to the audit queue.

The item was updated  in the Storage Provider, but not captured by DuraCloud (yet)
11404404nullnullDo nothing.Bit integrity processing is behind fully processed deletes.

 

Worker Manager

The Worker Manager, a.k.a. Workman, is the heart of the system. Or perhaps more aptly, the digestive system.  Workman is responsible for managing a pool of worker threads that are in turn responsible for processing different kinds of tasks. Multiple instances of workman may run at the same time, be they on the same and/or separate machines.  In fact, the scalability of the system depends on the ability to scale up worker nodes to process queue items in parallel.  The workman process attempts to read the high priority queue first, reading  up to 10 tasks at a time and distributes them to a pool of workers.   If not high priority tasks are available, it attempts to read the low priority queue.  Should that queue be empty as well, the DSTP back off exponentially before retrying again, waiting initially for 1 minute and will never wait longer than 8 minutes.  Once a task worker thread receives a task to process, it will monitor the progress and make sure that the visibility timeout is extended as necessary to prevent the item from reappearing on the queue.  Once the task has been processed, the task worker than deletes the item from the queue. In the case that the task could not be successfully processed, it will be placed at the back of the queue for reprocessing.   If it has failed three times, the task will be placed on the Dead Letter Queue for human review.

...