Publication
IUI 2024
Demo paper

EvaluLLM: LLM assisted evaluation of generative outputs

Abstract

With the rapid improvement in large language model (LLM) capabilities, its becoming more difficult to measure the quality of outputs generated by natural language generation (NLG) systems. Conventional metrics such as BLEU and ROUGE are bound to reference data, and are generally unsuitable for tasks that require creative or diverse outputs. Human evaluation is an option, but manually evaluating generated text is difficult to do well, and expensive to scale and repeat as requirements and quality criteria change. Recent work has focused on the use of LLMs as customize-able NLG evaluators, and initial results are promising. In this demonstration we present EvaluLLM, an application designed to help practitioners setup, run and review evaluation over sets of NLG outputs, using an LLM as a custom evaluator. Evaluation is formulated as a series of choices between pairs of generated outputs conditioned on a user provided evaluation criteria. This approach simplifies the evaluation task and obviates the need for complex scoring algorithms. The system can be applied to general evaluation, human assisted evaluation, and model selection problems.

Date

Publication

IUI 2024

Authors

Topics

Share