Techniques and tools for name disambiguation and entity resolution
Tools already actively in use in the VIVO community
prior to ingest
during ingest
after ingest in VIVO
Additional tools that may prove useful
borrowed from http://rawpatentdata.blogspot.com/2013/01/datamining-and-entity-resolutions-some.html
Name Cleaver
The OrganizationNameCleaver class has methods to reduce a name to only the "kernel" of the name, and also to expand all abbreviations (that Name Cleaver knows of), useful for matching tasks.
OYSTER entity resolution
OYSTER (Open sYSTem Entity Resolution) is an entity resolution system that supports probabilistic direct matching, transitive linking, and asserted linking. To facilitate prospecting for match candidates (blocking), the system builds and maintains an in-memory index of attribute values to identities. Because OYSTER has an identity management system, it also supports persistent identity identifiers. OYSTER is unique among other ER systems in that it is built to incorporate Entity Identity Information Management (EIIM). OYSTER supports EIIM by providing methods that enforce identifiers to be unique among identities, maintain persistent IDs over the life of an identity, and allowing the ability to fix false-positive and false-negative resolutions, which cannot be done with matching rules, through the use of assertion, traceability, and other features.
OPENCALAIS
Calais Web Service by Thomson Reuters. The web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events.
OpenCalais supports three types of entity disambiguation: Company disambiguation, Geographical disambiguation and Product (Electronics) disambiguation.
Disambiguation of company names - such as determining whether the company Olympus refers to Olympus Optical Co. Ltd. or Olympus Life and Material Science Europa. The resolution output for a given company mention includes:
- A URI that is unique and uniform across documents
- The formal English legal name of the company
- The company's ticker symbol (for public companies)
For company names that cannot be disambiguated, the returned results will include no resolution information.
AgroTagger
Used for indexing information resources, Agrotagger is a keyword extractor that uses the AGROVOC thesaurus as its set of allowable keywords. It can extract from Microsoft Office documents, PDF files and web pages.