Implementation and Development Call 20130404

Calls are held every Thursday at 1 pm eastern time (GMT-5) – convert to your time at http://www.thetimezoneconverter.com – and note that the U.S. is already on daylight savings time

These calls now use WebEx and have no limit on the number of attendees – see the "Call-in Information" at the bottom of this page.

Updates

Brown –
Colorado (Stephen) – did a test of new data release yesterday that went well; an intermediate step between original Selenium scripts and a fully automated Harvester-based ingest process. Also getting ready for the Implementation Fest in 3 weeks.
Cornell – (Tim) still working on changes we're making to the home page – map changes are complete with a global, U.S., and a New York State view. Now working on changes to statistics shown about counts in VIVO and some randomized selections of faculty, research facilities, etc. Other sites would be able to swap in other states or regions on the maps.
Duke – (Patrick) Jim is working on getting a SPARQL endpoint up with Fuseki and running into memory problems running certain kinds of queries. Looking at running multiple instances; not immediately going to have sites consuming data from the endpoint(s) but want to encourage migration of data feeds from SDB to the SPARQL endpoint.
Florida – (Nicholas) Using Python scripts to clean up data
Indiana – (Chin Hua and Ronan) – Did a couple of experiments to see whether the problem generating Map of Science visualizations are due to simultaneous requests; observed that each request completed on its own, but if put in 20-25 requests for the same visualization simultaneously, no visualizations were generated – perhaps because the application may in this case be making requests of the cache before it's completed. A question: how many people do you expect would be accessing VIVO at the same time – 10-20 (Nicholas). How are the services queued up? Found some caching-related code snippets that they believe relate to caching the Jena model in the RDB store in memory, and would direct SPARQL query directly to the RDB model. For certain large visualizationsNeed to find out more about the caching – an application for Memcached?
- Found that most of the time is being spent during SPARQL queries – would like some help examining these to see if they could be made more efficient
- Will post what they've learned to a wiki page
Johns Hopkins – We're unable to hop on the call today (we have Epic EMR going live at many campuses) but could use a little guidance.
- We keep bumping into issues with harvesting PubMed data through Harvester. We've broken up the data into smaller chunks and have successfully(ish) completed an update. The problem is that we now have duplicate articles, authors, etc. (see screenshot)
  - What are we missing?
- What recommendations do you have for the frequency of updates? Should we consider less frequent updates (semi-annual? annual?) for older publications?
- Why does it take nearly twice as long to run an update through Harvester? Our first pass through took about 16 hours. The second, over 20. (Stephen) Each part of the process takes longer as the dataset grows
Memorial University –
NYU –
Scripps – (Michaeleen)
- Articles with 100 or more authors: removing publications with no Scripps authors; these were included in the source publication database due to errors in disambiguation of Scripps authors and/or Scripps as an organization in literature searches
- All articles: verifying the match to faculty profiles and making corrections in the case of misattributions
- Missing authorship entries: ingest code updated; added missing authors and author attributions into VIVO
- Grant ingest from NIH RePORTER: working on how to put all sub-projects related to one project serial number as a single VIVO grant record; most sub-projects have identical titles and abstracts but some do not (not sure how to handle multiple titles; can create multiple abstracts and perhaps include the sub-project title as first sentence of second abstract entry); will also explore how to connect grants to publications using PMIDs to identify the linkages
Stony Brook –
SUNY Buffalo – (Mark) A couple of dead VIVOs – the index is indicating there is content when the application doesn't show it. Looking at logs to find clues and has backups. (Stephen) Has run into a couple issues with Solr – had to delete the Solr WAR file and web app and the Solr data file and when redeployed and restarted, and rebuilt the index, the problem had gone away. (Nicholas) If you navigate through your class hierarchy you can verify that you actually have data. (Tim) Once also had to delete the files in the localhost directory under Conf and Work (Jim) Thinks the problem Tim mentioned went away with Tomcat 1.7
UCLA –
UCSF – (Eric) Getting ready to add the OpenSocial feature to Profiles and trying to store the OpenSocial data using the ontology and not in a separate relational table so that it can show up in a search. How does VIVO avoid exposing certain data as Linked Open Data or to SPARQL queries? (Brian) We handle user accounts and their password hashes in the old RDB tables rather than in main SDB (or optionally other) triple store. Not an easy mechanism in VIVO to direct data to RDB instead of SDB – the application is hardwired to write only certain data. (Eric) With the CTSA recommendation to produce SPARQL endpoints for researcher networking systems, this question will be coming up. (Stephen) May be able to write data to a certain graph and configure Fuseki not to query that graph. (Eric) Wants to stay with the storage model established in the application – VIVO is all RDF while Profiles still has relational tables that are synced with an RDF layer, or Loki that stores everything as relational data and produces RDF only on the fly. (Jim) As you said, you could create an RDB model if you want to dig in that far. I would prefer you would store that data in RDB but we need to be able to provide you an easier way to do that.
WashU – (Kristi)
Weill Cornell – (Paul) Ironing out kinks in publications ingest workflow (getting data from Scopus and Pubmed) and are reasonably happy with it, and will next turn to more systematic automation so are doing incremental updates and deletes, even as the individual faculty-level queries may change. Will be happy to share when are further along.

Upcoming events

2013 Implementation Fest – April 25-26 at CU Boulder

A community hands-on development day is planned as an optional additional activity on Saturday the 27th, e.g. working on internationalization
- We have interest from Costa Rica, Mexico, and France in principle – there's one room choice option that has H.323 videoconferencing
2013 VIVO Implementation Fest page has information on transportation and hotels
Registration page – note that there is no registration fee for the workshop
A DRAFT schedule is available for preview on sched.org (including for mobile view) and as a Google Doc
- Schedule update – Pedro Szekely will be attending and will offer a session on Karma, the semantic data integration tool presented at our February 7 implementation and development call
- Colorado has sample CSV files for Pedro to test

Security issue

Arve Solland of Griffith University points out a security issue via an email to VIVO developers at Cornell. The issue affects release 1.5.1. The fix will be easy to develop, but how should we distribute it? What level of publicity/documentation is appropriate?

Jim expects there to be 3-4 lines to change in several controllers – not big, but more than a single file to deploy.
We have the issue that a lot of people using released files rather than checking out from GitHub. How would people like to receive the patch?
- A TAR file with everything – would be easier to drop in and replace the whole build rather than substitute individual files for Colorado, but they are doing a 3-tier build

Notable implementation and development list traffic

New

(Alex) Question about licensing of linked data in the context of historical persons and events, where publishers have been asked for permission – http://vivo.cornell.edu/termsOfUse, http://creativecommons.org/licenses/by/3.0/,
http://creativecommons.org/publicdomain/zero/1.0/
- (Paul) Here's an agreement for use of Scopus data in VIVO for Weill
- (Alex) Another use case might be for using existing controlled vocabularies
- (Jon) May also be necessary to post attribution. (Alex) Daniel Hook has found that a link back to the source is a good carrot approach
(Michael) Restrictions that return two types of classes –
- (Brian) – What you're supposed to be able to do (and has worked in the past) is to create a restriction with a union of multiple classes in another editor, such as Protege, and upload it to VIVO. Unfortunately, it looks like there may have been a regression somewhere in version 1.5 because I can't seem to get the form for adding properties to an individual to offer the right values now when I use a union. (The ontology editor properly displays the union class, so I know it's in there.) I'll have to look into this further, and at the very least I'll create a JIRA issue to make sure the latter piece gets unbroken for 1.6.
(Stephen) Found the source code for the DV-Docs to generate CVs from VIVO
(Kelly) Harvester duplicates and speed issues for pulling from PubMed
(Mark) I stopped and started tomcat when VIVO 1.5.1 got slow when editing. Now we get: There is currently no Graduate Student content in the system, There is currently no Undergraduate Student content in the system, There is currently no Person content in the system – I've rebuilt the index and it rebuilds with no change in results. I don't have any explicit documentation on backup/recovery, but i have recent mysql dumps and a copy of /usr/local/vivo. Not sure where to go from here. Sure don't want to re-start from bare metal.

Still open:

Adding a field to the Solr Index (Gawri)
- I need some help regarding solr index build. I added a new field called "APPROVED_RECORD" into schema.xml . Also I added following code lines into addObjectPropertyText() function in IndividualToSolrDocument.java file. (Jon) Will forward her question to Brian Caruso
Conditional validation (Tom)
- I was wondering if there was a way to change which validation is run based on the submitting button on an EditConfigurationGenerator form?
- I basically have the situation where I've used a custom BaseEditSubmissionPreprocessorVTwo to allow the user to submit the edit and go forward to the created/linked object or backward to the subject. This is working fine, but the cancel link on the form calls the PostEditCleanupController directly, which doesn't call the preprocessor to set the correct entity URI to redirect to. One solution to this is to move the cancel link to a cancel button, but this, predictably, causes the FieldVTwo validation to fail.
- The other solution is to work out how to forcibly invoke the preprocessor, but this seems as though it would be a much bigger issue, so conditional validation seems the easier way to achieve the same effect.

Readings

Under the hood -- indexing and ranking in Facebook's graph search

Call-in Information

Topic: VIVO weekly call

Date: Every Thursday, no end date

Time: 1:00 pm, Eastern Daylight Time (New York, GMT-04:00)

Meeting Number: 641 825 891

To join the online meeting

Go to https://cornell.webex.com/cornell/e.php?AT=WMI&EventID=167096322&RT=MiM2

If requested, enter your name and email address.

Click "Join".

To view in other time zones or languages, please click the link: https://cornell.webex.com/cornell/globalcallin.php?serviceType=MC&ED=167096322&tollFree=1

If those links don't work, please visit the Cornell meeting page and look for a VIVO meeting.

To join the audio conference only

To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.

Call-in toll-free number (US/Canada): 1-855-244-8681

Call-in toll number (US/Canada): 1-650-479-3207

Global call-in numbers: https://cornell.webex.com/cornelluniversity/globalcallin.php?serviceType=MC&ED=161711167&tollFree=1

Toll-free dialing restrictions: http://www.webex.com/pdf/tollfree_restrictions.pdf

Access code:645 873 290

Space shortcuts

Page tree