Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Data Structure Approaches

 

  In-memory Processing 
Triple Store Data StructureIn-memory Data StructureDirect
access
at
loaded
position
Direct
access
URI as ID
(pre-loaded)
Iteration
foward

Iteration
backward

Append
after
last
loaded

Append
after
global
last
Insert
after
loaded
item
Insert
before
loaded
item
Insert
after
loaded
position
Insert
before
loaded
position
Delete
loaded
item
Delete
from
loaded
position
Fetch
one by
global
position

Fetch
one by URI

Fetch
range
Fetch
next
range
Persist
one at
global
position
Persist
one by
URI
Persist
range
List
size
Comments
doubly linked list                      
 doubly linked listNOSLOWFASTFASTFASTFASTFASTFASTNONOFASTNONOFASTSLOWSLOWNOFASTSLOWNO 
 convert to array on fetchFASTSLOWFASTFASTFASTFASTFASTFASTFASTFASTFASTFASTNOFASTSLOWSLOWNOFASTSLOWNO 
 convert to hash on fetchNOFASTFASTFASTFASTFASTFASTFASTNONOFASTNONOFASTSLOWSLOWNOFASTSLOWNOkey=proxy URI
doubly linked list + order index                      
 doubly linked listSLOWSLOWFASTFASTSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWFASTFASTSLOWSLOWFASTSLOWFAST 
 convert to array on fetchFASTSLOWFASTFASTSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWFASTFASTSLOWSLOWFASTSLOWFAST 
 convert to hash (key=proxy URI) on fetchSLOWFASTFASTFASTSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWSLOWFASTFASTSLOWSLOWFASTSLOWFASTkey=proxy URI

 

Analysis of Triple Store Data Structure

Triple Store Data StructureProsConsComments
doubly linked list+ all updates effect at most 3 items (i.e., prev, current, and next)- requires traversal of links during fetch (i.e. fetch each item individually - one query for each item) 
doubly linked list + order index+ much faster fetch retrieving all items in a range (i.e., single query gets all items)- adding/removing items from list can be very slow when item is early in the list (worst case touches all list items) 

 

Analysis of In-memory Data Structure

In-memory Data StructureProsConsComments
doubly linked list+ least amount of processing during fetch (probably not significant relative to overall processing required)- iteration requires traversing next links 
convert to array on fetch

+ allows for fast access by position to loaded items
+ faster iteration than traversing next links (may not be significant)

  
convert to hash on fetch+ allows for fast fast access by proxy URI to loaded items- iteration requires traversing next links

I anticipate the need to have direct access by proxy URI more than positional.

 

Limit testing for array in-memory data structure

Tests descriptions:

  • array_create - use Array fill method to add array items with values from 0 to max_items
  • array_move - use Array insert(to, delete_at(from)) to move an item from the end of the filled array to the beginning (worst case scenario)
  • list_create - create a list header data structure and list item data structures with sample real world data using items[i]=item_info to add each item to the items array
  • list_move - use Array insert(to,delete_at(from)) to move an item from the end of the filled array to the beginning AND update prev and next links
  • list_find - use 0.upto(items.size-1) to check the value of items[i][:uri] to see if it matches the search value for uri - Test looks search for last item in list

Environments for testing:

  • Laptop
    • RAM: 16 G  (approximately 4.5 G is allocated to other running programs prior to running tests)
    • Processor:  2.3 GHz quad core
  • DEV VM
    • RAM: 2 G  (approximately 0.6 G is allocated to other running programs prior to running tests)
    • Processor:  2.3 GHz dual core

Target production system:

  • RAM: 2-8 G
  • Processor: 2.3 GHz dual core

Test code:  https://gist.github.com/elrayle/f5f559f8c10243600dc6

 

Results:

Max Itemsarray_createarray_movelist_createlist_movelist_insertlist_appendlist_find_lastlist_find_randomComments
 LaptopDEV VMLaptopDEV VMLaptop
(16Gb)
DEV VM
(2Gb)
LaptopDEV VMLaptopDEV VMLaptopDEV VMLaptopDEV VMLaptopDEV VM 
1,0000000000000000000 
10,0000000000000000000 
100,0000000000000000000 
500,0000000120000000000 
1,000,0000000450000000000 
2,000,000000011130000000000 
4,000,000000035OOM0OOM0OOM0OOM1OOM0OOM 
8,000,0001100115 0 0 0 2 0  
10,000,0001100167 0 0 0 2 0  
20,000,0001200OOM OOM OOM OOM OOM OOM  
40,000,0004400             
80,000,0006800             
100,000,00081111             

* Time measured in seconds

* OOM - Out of Memory

...

The ORE implementation in the triple store will maintain a doubly linked list using IANA ontologies first and last predicates on the list and next and prev on the proxy.  Once loaded into memory, the ORE triples will be converted into a List Header Info hash data structure with an array of items each holding a List Item Info hash data structure.  Testing was executed to analyze the efficiency of working with arrays of List Item Info hash data structures.

Limit testing for array in-memory data structure

Tests descriptions:

  • array_create - use Array fill method to add array items with values from 0 to max_items
  • array_move - use Array insert(to, delete_at(from)) to move an item from the end of the filled array to the beginning (worst case scenario)
  • list_create - create a list header data structure and list item data structures with sample real world data using items[i]=item_info to add each item to the items array
  • list_move - use Array insert(to,delete_at(from)) to move an item from the end of the filled array to the beginning AND update prev and next links
  • list_find - use 0.upto(items.size-1) to check the value of items[i][:uri] to see if it matches the search value for uri - Test looks search for last item in list

Environments for testing:

  • Laptop
    • RAM: 16 G  (approximately 4.5 G is allocated to other running programs prior to running tests)
    • Processor:  2.3 GHz quad core
  • DEV VM
    • RAM: 2 G  (approximately 0.6 G is allocated to other running programs prior to running tests)
    • Processor:  2.3 GHz dual core

Target production system:

  • RAM: 2-8 G
  • Processor: 2.3 GHz dual core

Test code:  https://gist.github.com/elrayle/f5f559f8c10243600dc6

 

Results:

Max Itemsarray_createarray_movelist_createlist_movelist_insertlist_appendlist_find_lastlist_find_randomComments
 LaptopDEV VMLaptopDEV VMLaptop
(16Gb)
DEV VM
(2Gb)
LaptopDEV VMLaptopDEV VMLaptopDEV VMLaptopDEV VMLaptopDEV VM 
1,0000000000000000000 
10,0000000000000000000 
100,0000000000000000000 
500,0000000120000000000 
1,000,0000000450000000000 
2,000,000000011130000000000 
4,000,000000035OOM0OOM0OOM0OOM1OOM0OOM 
8,000,0001100115 0 0 0 2 0  
10,000,0001100167 0 0 0 2 0  
20,000,0001200OOM OOM OOM OOM OOM OOM  
40,000,0004400             
80,000,0006800             
100,000,00081111             

* Time measured in seconds

* OOM - Out of Memory

 

Analysis:

Converting triples data into an array of items is constrained by memory and the execution time of retrieving data from the triple store and constructing the list.  A modest 2Gb of RAM will support holding all data up to approximately 100,000 items and maintain processing efficiency.  Implementations supporting a list of greater than 100,000 items or many concurrent lists of smaller item counts will likely require a more sophisticated approach to retrieving and storing items in-memory.

 

Alternate Approaches:

  • Combine ActiveTriples with Solr.  In this implementation, ActiveTriples will be used for managing and updating individual resources and/or a few of resources at a time (e.g. the Aggregation resource and any modified Proxy resources).  For managing viewing and paging through lists, data will be stored in Solr and use built in Solr features for sorting, requesting pages of data and paging forward and back, extending Proxy resource data to include data about the item being aggregated that comes from different data stores, etc.

  • Inclusion of an order predicate in the triples for the list AND/OR in the Solr fields for the proxies.  More exploration needs to happen to determine the feasibility of inclusion of an order triple in the list triples.  Any change to the order will require a change to potentially all proxy triples in the list.  The same scale of change would be required on the Solr side, but could be done asynchronously.  In this approach, the triple store will be immediately correct upon persistence of the change and the Solr index may lag.  As the Solr index is used primarily for UI display, the lag is likely acceptable with the proxies near the display being changed first and the out of date proxies being farther away from those being displayed.

 

...

Use Cases:

Use Case: Fetch the first N items in the list and display to user; Fetch next N and display; Fetch prev N and display; Fetch page X and display

...

  • Fetch range starting from first and retrieving the next N items
  • Iterate over list from ListHeaderInfo.first_loaded to last_loaded using ListItemInfo.next (next method) to traverse the list
    • Display items in order with value=ListItemInfo.proxyFor URI and id=ListItemInfo.uri  (Actually will get the title from a proxyFor triple and use that as the display value, but proxyFor's title isn't available from List Item Info.  It requires extra processing beyond the ORE GEM.)
  • User presses next button causing fetch of next N items starting from ListHeaderInfo.resume_next_token
  • Iterate and display
  • User presses previous button causing fetch of prev N items starting from ListHeaderInfo.resume_prev_token
  • User presses button for page X; Find first item on page X
    • For Doubly linked list + order index, the first item on page X with page size = N can be calculated. 
    • For Doubly linked list, not directly supported.  Could do traversal of list (potentially asynchronous) to determine first item on Y pages beyond and preceding the current page.
Analysis:
  • In-memory Data Structure
    • All three will perform equally well for iteration from first to last.
  • Triple Store Data Structure
    • Doubly linked list + order index will perform better for fetch N items.  However, this is likely to not be significant at low N.  Not sure how large N needs to be before this becomes an issue.
    • Doubly linked list + order index Triple Store Data Structure will perform much better for fetch page X.

...

  • Fetch occurs according to previous user case.
  • Before display,
    • traverse the loaded items, fetch(rdf_subject=ListItemInfo.proxyFor)
    • get object values to be displayed
    • add to ListItemInfo hash structure
  • Send updated ListItemInfo structures to UI for display
Analysis:
  • In-memory Data Structure
    • All three will perform equally well for traverse and update.
    • See also analysis of Fetch N items use case.
  • Triple Store Data Structure
    • Both will perform equally well for traverse and update
    • See also analysis of Fetch N items use case.

...

  • KNOWN: ListItemInfo.uri of item being moved (from id of item in display list) – could also use ListItemInfo.position
  • KNOWN: ListItemInfo.uri of item preceding position of insert – could also use ListItemInfo.position
  • update links of old prev item, old next item, new prev item, new next item, and item being moved (and potentially list first or last if needed)
  • persist all modified resources (perhaps triggered by user clicking Save button or auto-save with each change)
Analysis:
  • In-memory Data Structure
    • Hash is most efficient with direct access by ListItemInfo.uri (key of hash) and no moves are required within the data structure.  Only links are updated.
    • Doubly linked list is less efficient as the list has to be traversed to find the items to update.  No moves are required within the data structure as only links are updated.
    • Array is the least efficient.  Direct access using ListItemInfo.position can be used to locate the items, but the array will need to be reordered to reflect the change in order of the list items.  (Assumes reorder operation is more expensive than traversal operation.)
  • Triple Store Data Structure
    • Both will perform equally well.

...

  • create new instance of ORE::Proxy
  • fetch ListHeaderInfo.last
  • update list items prev and next links and list last
  • persist all three resources
Analysis:
  • In-memory Data Structure
    • All three will perform equally well.
  • Triple Store Data Structure
    • Both will perform equally well.

...

  • How to query across from one triplestore, TS?  Query: fetch where TS.uri in TS.uri.aggregates sort_by TS.title start_at 60 limit 20
  • How to query across two triplestores, TS1 for list and TS2 for bibliographic references?  Query: fetch where TS2.uri in TS1.uri.aggregates sort_by TS2.title start_at 60 limit 20
  • How to query from Solr?  How to add to Solr?  How to make incremental updates?
Analysis:
  • In-memory Data Structure
    • ???
  • Triple Store Data Structure
    • ???

...