About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
PROPOR 2024
Short paper
Training Large Language Encoders with the Curated Carolina Corpus
Abstract
In this paper we present the results of training large language encoders with the curated Carolina Corpus. Large language models (LLMs) are trained on very large amounts of data, which can be quite expensive to collect. We show in this presentation that a curated corpus can be used to train models with similar performance to models trained on datasets almost 3 times larger.