Training Large Language Encoders with the Curated Carolina Corpus

Guilherme Lamartine Mello; Paulo Rodrigo Cavalin; Felipe Serras; Marcelo Finger; Pedro Domingues; Miguel De Mello Carpi; Marcos Jose

PROPOR 2024

Short paper

12 Mar 2024

Training Large Language Encoders with the Curated Carolina Corpus

Abstract

In this paper we present the results of training large language encoders with the curated Carolina Corpus. Large language models (LLMs) are trained on very large amounts of data, which can be quite expensive to collect. We show in this presentation that a curated corpus can be used to train models with similar performance to models trained on datasets almost 3 times larger.

Paper