CodeCommons: Opportunities for Generative AI and Software Engineering

by Mathieu Acher
19/12/2024
DiverSE Coffee
Rennes, France

Abstract

The CodeCommons project builds on the Software Heritage initiative to consolidate and scale up a unique digital commons of open-source code. During 24 months, it seeks to accelerate code collection, introduce a new unified data model, and extensively characterize source files to enhance the training of next-generation, truly open source AI models, ensuring transparency and efficiency. In this talk, I will first briefly present the specific implications of the DiverSE team in this project. Then, I will review some recent works on (1) generative AI, from the training of open-source large language models (LLMs) for code, such as StarCoder2 and OpenCoder, to their concrete use cases; (2) software engineering, especially empirical studies that leverage software repositories. Finally, and most importantly, I will invite everyone to actively share their use cases, ideas, requirements, and wish lists regarding CodeCommons, encouraging all participants to become involved in shaping its future.