Distilling and exploring nuggets from a corpus
Abstract
This paper describes a live and scalable system that automatically extracts information nuggets for entities/topics from a continuously updated corpus for effective exploration and analysis. A nugget is a piece of semantic information that (1) must be mapped semantically to the transitive closure of a pre-defined ontology, (2) is explicitly supported by text, and (3) has a natural language description that completely conveys its semantic to a user. Fig. 1 shows a type of nugget "involvement in events" for a person entity (Leon Panetta): each nugget has a short description ("meeting", "news conference") with a list of supporting passages. Our key contributions are (1) We extract nuggets and remove redundancy to produce a summary of salient information with supporting clusters of passages. (2) We present an entity/topic centric exploration interface that also allows users to navigate to other entities involved in a nugget. (3) We use the statistical NLP technologies developed over the years in the ACE ,GALE and TAC-KBP programs, including parsing, mention detection, within and cross document coreference resolution, relation detection and slot filler extraction. (4) Our system is flexible and easily adaptable across domains as demonstrated on two corpora: generic news and scientific papers. Search engines such as Google News and Scholar do not retrieve nuggets, and only remove redundancy at document level. News aggregation applications such as Evri categorize news articles based on the entities of topics but do not extract nuggets. Other systems extract richer information, but not all of it has clear semantics; e.g., Silobreaker presents results as "the relationship between X and Y in the context of [keyphrase]", leaving users with the task of interpreting the semantics as it is not tied to a clear ontology. In contrast we remove redundancy, summarize results and present nuggets that have clear semantics.