CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
- Ruchir Puri
- David Kung
- et al.
- 2021
- NeurIPS 2021
Translating source code from one high-level language to another is a long-standing problem in the programming language and software engineering communities. This task has become more important due to the need to translate code from newer languages like Dart and Go to more established languages like Java and Python, and to migrate legacy mainframe applications to the hybrid cloud.
Traditional approaches to code translation are rule-based, which can be time-consuming and expensive to develop. The AI and natural language processing communities have recently explored using large language models (LLMs) to solve this task.
While LLMs are less expensive than rule-based methods, they have their challenges. They require significant quantities of data and are probabilistic models that don't always produce high-quality code consistently. This is a problem for code translation because less data is available for less common languages.
To address this issue, the project aims to develop LLM-based code translation tools for low-resource programming languages. The team is prioritizing the development of a tool that translates COBOL applications into modern Java, which will allow for the creation of modern replacements for vital infrastructure that currently relies on COBOL.
The project team has also published a large-scale dataset called CodeNet, which contains over 14 million code snippets in 55 different programming languages, to teach AI to code.