General Purpose Semantic Parsing

The project builds on decades of prior work on computational grammars by project partners: the English Resource Grammar (ERG; Flickinger, 2000) continuously developed at Stanford University since 1993. This grammar implements arguably the most influential grammatical theories for HLT and runs on specialized software: PET (Callmeier, 2002). These resources embody the ‘deepest’ broad-coverage approaches to parsing available today and have long been key elements of multi-national collaboration in DELPH-IN.

The ERG outputs meaning representations in the widely used Minimal Recursion Semantics framework (MRS; Copestake, Flickinger, Pollard & Sag, 2005). The figure below shows a simplified MRS for our example: Cisco, CNN reports, sets out to acquire Tandberg in December.

⟨ h₁,
   h₇:named(x₅, CNN), h₈:_report_v(e₂, x₅, h₁₀),
   h₁₅:named(x₁₃, Cisco), h₁₆:_set+out_v_aim(e₁₇, x₁₃, h₁₈),
   h₁₉:_acquire_v(e₂₀, x₁₃, x₂₁), h₂₁:named(x₂₁, Tandberg),
   h₁₉:_in_p_temp(e₂₀, x₂₃), h₂₆:mofy(x₂₃, 12)
   { h₁ =_q h₈, h₁₈=_q h₁₉, h₁₀=_q h₁₆ } ⟩

Ignoring formal detail, structural relations hold between logical variables, e.g. the entity x₅ partaking in both the naming (as CNN) and reporting relations, or the acquiring event e₂₀ being temporally restricted to the 12^th month of the year (‘mofy’). The shared agent of the aiming and acquiring events is captured by variable identity (x₁₃), and the equality of ‘handle’ variables h10 and h16 embeds the ‘setting out’ as the semantic object of the reporting. Such will be the general form of interface representations for the semantic parsers.

The project requires a limited amount of grammar extension and adaptation, particularly adding analyses of new UGC-specific grammatical constructions and non-sentential utterances. The core of the proposed research, however, will focus on technology improvements and experimentation geared at increased input robustness, output precision, parsing efficiency, and techniques for adaptation and configuration.

Part of Speech (Super)Tagging

First, the ERG parser now standardly applies a so-called Part of Speech (PoS) tagger for pre-processing its input; PoS tagging is a (statistical) sequence classification process assigning coarse-grained lexical categories to input tokens—e.g. noun, verb, or adjective—prior to full grammatical analysis. For words unknown to the ERG lexicon, PoS tags are used to robustly provide under-specified, fall-back lexical entries. However, seeing massive lexical ambiguity in English and other human languages (forms like dances can be nominal or verbal), reliable PoS tags could in principle be used to prune the search space of the semantic parser, or to guide the statistical ranking of alternate analyses. Thus, besides increased robustness, novel methods for tighter integration of PoS tagging and parsing will likely yield improvements in output precision and parsing efficiency.

An extension of PoS tagging is so-called supertagging, sequence classification of parser inputs with much more detailed lexical categories. Where standard PoS tags make a few dozen distinctions, supertags tend to number in the hundreds or thousands. Albeit abstractly similar to using PoS tags, supertagging is a more difficult preprocessing task, and making optimal use of supertags in semantic parsing an unsolved problem. A recent dissertation on parsing with the ERG (Dridan, 2009) suggests potential for substantial improvements in robustness, precision, and efficiency. Preliminary experiments, however, remain isolated along one dimension of variation at a time and limited to formal English. The project addresses remaining issues in this work, develop new techniques for combining supertagging with the ERG, and seek to validate these in multi-dimensional experiments.

Robust Parsing

Another area of parser hybridization for increased robustness to out-of-scope inputs is so-called robust parsing. Here, the project will experimentally contrast the approach dominant in parsing with the XLE (heuristic integration of parse fragments) with the work of Zhang & Kordoni (2008) (a PCFG fall-back and robust semantic construction), seek a synthesis of strong points of both techniques, and document robustness vs. precision trade-offs when parsing user-generated content. Resulting algorithmic enhancements will need to be compatible with the approach to improved XLE parsing efficiency of Cahill, Maxwell, Meurer, Rohrer, & Rosén (2008).

Second, disambiguation is the task of selecting the ‘correct’ (most probable) parse(s) from the space of possible analyses delineated by the grammars. Whether rule-based at its core or not, statistical disambiguation is a central component of all computational parsing nowadays—typically training high-dimensional Maximum Entropy or SVM machine learning models from treebanks (Abney, 1997; Johnson, Geman, Canon, Chi, & Riezler, 1999; Toutanova, Manning, Flickinger, & Oepen, 2005; inter alios). For inputs of about 17 words average length, current ERG parse disambiguation ranks the correct analysis highest in less than sixty per cent of cases, even though it is among the top ten candidates according to the model in nine of ten cases. Fujita, Bond, Oepen, & Tanaka (2007) demonstrate the efficacy of semantic properties in parse disambiguation (for Japanese), but current models used with the ERG do not exploit such information. For the simpler case of context-free syntactic parsing, Charniak & Johnson (2005) suggest a range of relevant properties, applied in a two-stage disambiguation setup. The proposed research will synthesize these approaches into a uniform framework, seeking to substantially improve parse selection accuracy for the ERG through semantic and additional non-local properties, and to combine the benefits of improved precision with (super)tagging and early pruning in a cascaded approach to disambiguation.

Finally, parse disambiguation models are standardly trained in a supervised manner—from annotated training data—which inhibits adaptability across domains and genres. Plank & van Noord (2008) and Rimell & Clark (2008) suggest semi-supervised techniques, assuming only small quantities of domain-specific annotation. The project will adapt these and similar techniques to semantic parsing with the ERG, experimentally contrast these approaches on various types of UGC, and seek a further reduction in adaptation cost through the combination with so-called self training (McClosky, Charniak, & Johnson, 2008), a technique which Dridan (2009) suggests can be especially beneficial in supertagging.

By Stephan Oepen, Lilja Øvrelid

Published Sep. 27, 2012 1:21 PM