Conference paper

Finding the Conversation: A Method for Scoring Documents for Natural Conversation Content

Abstract

With generative AI acquiring the right training data is a critical part of designing the user experience. Training large language models to talk like humans requires exposing them to the interaction patterns distinctive of natural conversation. Although models are typically fine-tuned on question-answer or instruction pairs, they are less often trained on real-time human conversations. Natural conversation data are hard to find and "conversation" is used to mean very different kinds of interaction or content. We demonstrate a method for scoring language content using generic conversational phrase detection. We generate three scores: 1) range of unique features, 2) density of features within sections of the content, and 3) overall score combining these. Using our method, we score over 27,000 documents from 6 datasets, which vary widely in terms of whether or not they contain conversation content. Our results show this approach is effective in distinguishing conversation content from non-conversation and from conversation-like content.

Related