ATHENA++: natural language querying for complex nested SQL queries
Abstract
Natural Language Interfaces to Databases (NLIDB) systems eliminate the requirement for an end user to use complex query languages like SQL, by translating the input natural language (NL) queries to SQL automatically. Although a significant volume of research has focused on this space, most state-of-the-art systems can at best handle simple select-project-join queries. There has been little to no research on extending the capabilities of NLIDB systems to handle complex business intelligence (BI) queries that often involve nesting as well as aggregation. In this paper, we present Athena++, an end-to-end system that can answer such complex queries in natural language by translating them into nested SQL queries. In particular, Athena++ combines linguistic patterns from NL queries with deep domain reasoning using ontologies to enable nested query detection and generation. We also introduce a new benchmark data set (FIBEN), which consists of 300 NL queries, corresponding to 237 distinct complex SQL queries on a database with 152 tables, conforming to an ontology derived from standard financial ontologies (FIBO and FRO). We conducted extensive experiments comparing Athena++ with two state-of-the-art NLIDB systems, using both FIBEN and the prominent Spider benchmark. Athena++ consistently outperforms both systems across all benchmark data sets with a wide variety of complex queries, achieving 88.33% accuracy on FIBEN benchmark, and 78.89% accuracy on Spider benchmark, beating the best reported accuracy results on the dev set by 8%.