Research
4 minute read

Can LLMs learn social skills by playing games?

A new open-source framework, TextArena, pits large language models against each other in competitive environments designed to test and improve their communication skills.

Large language models are moving beyond fixed question-and-answer tasks to more complex, open-ended problems. TextArena is part of a new class of benchmarks designed to test and stretch their capabilities.

TextArena is an interactive open-source platform, developed by researchers at Singapore’s Agency for Science, Technology and Research (A*STAR) and IBM Research, in which LLM agents compete against each other at more than two dozen text-based games. Reminiscent of bare bones computer games from the 1980s, most of the puzzles and card and board games on the site require logic, strategy, and negotiation skills to succeed.

These kinds of soft skills have been difficult to teach, let alone evaluate, in traditional language models. As LLMs now transition into AI agents, a variety of new training and evaluation paradigms are emerging.

These more complex environments measure how well LLM agents can plan, call tools, execute tasks, and interact with each other to accomplish real-world tasks, from diagnosing and remediating IT issues to managing industrial machinery across their life cycle.

TextArena marks a return to the controlled environments of a decade ago, when specialized AI models achieved superhuman performance in narrow domains like chess, Go, and Pac-Man. This time, though, the ‘players’ are statistical models of language, and the objective is to learn a variety of games that invoke different soft skills, from negotiation to empathy to conflict resolution.

LLMs today at are great at generating text and code, essentially paraphrasing knowledge they hoovered up from the internet. But they still struggle with the kinds of unpredictable scenarios and social interactions that define real life. TextArena was designed to fill this void.

“A lot of what we want to do with agents is open-ended, and we’re only starting to learn how to both train and evaluate those types of skills,” says Leshem Choshen, an IBM researcher who co-founded the site with A*STAR researchers Leon Guertler and Bobby Cheng.

At NeurIPS this December, researchers will get a chance to test the social intelligence of their models in the inaugural Theory-of-Mind challenge, pitting contestants against each other in four Text Arena-hosted games.

We now have AIs that are superhuman at chess, but we've never had one that’s superhuman at negotiation,” said Leon Guertler. “We don't even know what that would look like.

‘A gym of sorts’

A mutual interest in “tiny” language models brought Choshen, an expert on AI evaluation, and Guertler together. They had separately been experimenting with tiny models and trying to figure out how to fairly compare models large and small. At some point, their conversation shifted to dynamic benchmarks, and the potential of using reinforcement learning (RL) to test and improve the cognitive skills of LLM agents.

Building an RL platform for training and evaluation had numerous benefits.

Evaluations could run continuously, with the leaderboard updating in real time. The platform could simulate the real world more closely than a quiz or an exam, allowing LLMs to learn through trial and error rather than by imitating moves or examples provided by humans. It could also generate a nearly limitless amount of data they could pour back into training.

Best of all, the researchers believed they could challenge LLMs in ways that fixed benchmarks, and old school RL environments without language, could not. “We now have AIs that are superhuman at chess, but we've never had one that’s superhuman at negotiation,” said Guertler. “We don't even know what that would look like.”

They decided to name the platform TextArena, bought the domain name, and designed the website to look like early AI “gyms,” where AI enthusiasts would go to hone their model’s skills at specific tasks.

The site received an unexpected boost a few weeks after its launch in January, when AI pioneer Andrej Karpathy urged his 1.2 million followers on X to build more open RL environments, which he likened to a “gym of sorts,” where LLMs could improve their cognitive strategies.

Guertler responded almost immediately:

Perfect timing, we are just about to publish TextArena. A collection of 57 text-based games (30 in the first release) including single-player, two-player and multi-player games. We tried keeping the interface similar to OpenAI gym, made it very easy to add new games, and created…

Since then, 216 LLMs have competed in more than 100,000 games, and the TextArena library has been downloaded for AI training purposes more than 65,000 times. Anthropic's Claude-3.5-Sonnet currently holds first place, with a "true skill" score of 29.6, followed by DeepSeek's Qwen-Max, with 28.1 , and Google's Gemini 2.0 model, with 27.5.

A magnet for video game and AGI enthusiasts

AI hobbyists from the open-source community have enthusiastically embraced TextArena, implementing new games, reporting and fixing bugs, and building out the user interface.

Alex Duffy, a consultant at Every Inc, a startup that provides AI-related back-office solutions, is currently adding an open-source version of the game Diplomacy to the site. He’s excited to see how quickly LLMs can learn negotiating skills, as well as other cognitive skills needed to succeed at games on the site.

“In the early gym environments, you tried to be good at one game,” he says. “Here, you’re changing arenas and changing the prompts. It’s much more flexible.”

Another volunteer, Simone Romeo, has designed more than 30 games targeting skills not yet on the site. They include adaptations of prisoner’s dilemma and a take on the modern elevator pitch. In his day job, he’s a Milan-based learning and development consultant who develops games to help employees practice things like sales pitches and leadership skills.

He’s intrigued by the challenge of teaching LLMs soft skills at a time when so many are “crashing all of the reasoning benchmarks.” “We want to use AI to improve our lives, so the better they can interact with us, the more useful they will be,” he says.

With the help of the open-source community, the team continues to add new games to TextArena, with a focus on games requiring cooperation and theory of mind. Try some of the games at textarena.ai.

Related posts