Consistency-based Black-box Uncertainty Quantification for Text-to-SQL
Abstract
When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides an estimate of the confidence in an LLM's generated output and is therefore increasingly recognized as a crucial component of trusted AI systems. UQ is particularly important for complex generative tasks such as text-to-SQL, where an LLM helps users gain insights about data stored in noisy and large databases by translating their natural language queries to structured query language (SQL). In this paper, we investigate the effectiveness of black-box UQ techniques for text-to-SQL, where the consistency between a generated output and other outputs is used as a proxy for estimating its confidence. We propose a high-level similarity aggregation approach that is suitable for complex generative tasks, including specific techniques that train confidence estimation models using small training sets. Through an extensive empirical study over various text-to-SQL datasets and models, we provide recommendations for the choice of sampling technique and similarity metric. The experiments demonstrate that our proposed similarity aggregation techniques result in better calibrated confidence estimates as compared to the closest baselines, but also highlight how there is room for improvement on downstream tasks such as selective generation.