Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The current architecture of DSpace uses the DSpace Intermediate Metadata (DIM) format to store information about the items in its archive. It is presently crafted with a view to being concealed from outer view. This proposal is targeted at translating this private metadata format into an accepted metadata publishing standard, and integrating it into the web of data in accordance with Linked Data best practices (http://linkeddata.org/Image Removed). By creating de-referencable URIs (as RDF/XML, N3, HTML etc.) for everything in the metadata store, it becomes trivial for a user to browse the data store based around related concepts ("This Author", "This Subject" and the like) simply by clicking on the metadata item in the Manakin display.

...

  • Must be XSLT'able
  • Ideally fast; streaming straight from the database would be preferable
  • TriX has some promise (http://www.hpl.hp.com/techreports/2004/HPL-2004-56.htmlImage Removed)
  • Possible solution: "R3X", a subset of RDF/XML with no grouping or nesting: http://www.wasab.dk/morten/blog/archives/2004/05/30/transforming-rdfxml-with-xsltImage Removed
    • Initial experiments (using Jena) suggest that the built in RDF/XML-ABBREV serializer is slower than the built in RDF/XML serializer, as expected
    • An R3XWrter has been written to write out any Jena Model, based on the standard RDF/XML writer
    • StreamingStatementModel implementation written, to stream Triples in the base Graph using an iterator (only good for a single use, then the iterator is consumed)
    • Working from memory, this seems a small (but consistent) percentage slower than the RDF/XML writer. Profiling shows this to be a result of the extra calls to write out the extra rdf:Description for each statement. This is despite the standard Model construction time (from a List<Triple>) being factored in.
    • Streaming an SQL ResultSet from MySQL one row at a time appears to vary in speed; neither technique appears to have a statistically significant speed advantage
    • Conclusion: There appears to be no conclusive evidence upon which to base a recommendation of the R3X serialization format, as it is more verbose and thus a larger amount of data to work with. The overhead involved in grouping statements of the same subject appears to be minimal (thanks to Jena's internal architecture). The only compelling reason to opt for R3X would appear to be memory usage; should so many statements be used in a single model as to require streaming, this may be required to reduce memory usage. However, were this the case it should be feasible to use
      Code Block
      ORDER BY
      in the SQL and group statements in the DBMS. A variation on this streaming code could easily be made to have similar output to the RDF/XML Writer, but to stream results.
    • Potential ToDos: Investigate if there is there something to be gained by re-attempting this experiment with a different RDF API / DBMS. Group streamed results in the R3XWriter?
  • Possible solution: Stream grouped results, using
    Code Block
    ORDER BY
    in SQL
    • Assume grouped ordering in R3XWriter in order to group results about resources in output, thus minimising the repetitive use of <rdf:description /> tags
    • This appears to be slower than the standard Jena RDF/XML writer for models over 10,000 triples. For smaller models, it affords a 2-3x speedup (this is counting the cost of retrieving results from the database) Most likely cause for slowdown is the SQL ordering in database.
    • Updated StreamingStatementModel to cache results from iterator in memory, and to deliver a union of these graphs
    • StreamingStatementModel can now be used for more than one iteration over its contents (but with added memory overhead, once the iterator has been consumed, as no further statements are streamed)
    • Once iterator is consumed, behaves exactly as if it were a normal in-memory Model - after partial iterations, it is a hybrid (half cached results, half streamed
    • This causes the R3XWriter to perform a small margin slower on writes, but allows the StreamingStatementModel to be plugged in to the standard Jena writers
    • New StreamingStatement model outperforms Jena's standard in-memory model within the RDF/XML writer by a factor of 1.5-2x for larger models (10,000 triples and up)
    • Conclusion: The new Model implementation suffers a slight penalty in object creation, but is probably an improvement for its flexibility. If models to be worked with are relatively small (< 10,000 triples), the new grouping R3XWriter offers an improvement. Otherwise, a StreamingStatementModel works best with the standard Jena Writer.

...

References and Other Research

References

http://www.tomjewett.com/dbdesign/dbdesign.php?page=intro.htmlImage Removed Tom Jewett provides an excellent beginners overview of relational modeling in this tutorial, I've found that the patterns located here manifest themselves throughout the the DSpace database. --Mark Diggory 18:32, 26 May 2008 (EDT)

http://hcs.science.uva.nl/usr/Schreiber/docs/owl-uml/owl-uml.htmlImage Removed Some alignment details between UML and OWL Lite.

Other Research

http://simile.mit.edu/reports/stores/Image Removed and http://esw.w3.org/topic/RdfStoreBenchmarkingImage Removed seem well executed, if potentially out of date - is there any value in repeating this work with DSpace specific instance data? – Peter Coetzee

D2RQ

http://www.w3.org/2007/03/RdfRDB/papers/d2rq-positionpaper/Image Removed

  • Example of D2RQ mapping rendered for the dspace.mit.edu database. Mapping_n3.n3mht  

OpenLink Virtuoso

http://virtuoso.openlinksw.com/wiki/main/Main/VOSSQLRDFImage Removed