Publication
PROPOR 2024
Short paper

Training Large Language Encoders with the Curated Carolina Corpus

Abstract

In this paper we present the results of training large language encoders with the curated Carolina Corpus. Large language models (LLMs) are trained on very large amounts of data, which can be quite expensive to collect. We show in this presentation that a curated corpus can be used to train models with similar performance to models trained on datasets almost 3 times larger.