About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSE 2023
Tutorial
The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks
Abstract
Appropriate representation of source code and its relevant properties form the backbone of Artificial Intelligence (AI)/ Machine Learning (ML) pipelines for various software engineering (SE) tasks such as code classification, bug prediction, code clone detection, and code summarization. In the literature, researchers have extensively experimented with different kinds of source code representations (syntactic, semantic, integrated, customized) and ML techniques such as pre-trained BERT models. In addition, it is common for researchers to create hand-crafted and customized source code representations for an appropriate SE task. In a 2018 survey [1], Allamanis et al. listed nearly 35 different ways of of representing source code for different SE tasks like Abstract Syntax Trees (ASTs), customized ASTs, Control Flow Graphs (CFGs), Data Flow Graphs (DFGs) and so on. The main goal of this tutorial is two-fold (i) Present an overview of the state-of-the-art of source code representations and corresponding ML pipelines with an explicit focus on the merits and demerits of each of the representations (ii) Practical challenges in infusing different code-views in the state-of-the-art ML models and future research directions.