Replicating Science in the Context of Machine Learning and Software


Expected start date2020-05-01

Estimated duration3 months


One of the best way to understand and contribute to science is to replicate existing studies. For instance, a new algorithm can be proposed in a scientific paper, showing impressive improvements with regards to the state of the art; we can wonder whether the claimed properties really hold, especially in another context.

The aim of this internship is to consider/read specific papers, investigate whether the results can be fully reproduced, and then make vary the experimental conditions (e.g., data used in the experiment).
By replicating studies, the intern will have a deep understanding of a specific scientific subject and also learn about the scientific method itself. The internship is really practical: We will select papers for which the code and data are available.
The main task will be to reuse and customize the existing code/data in order to produce and discuss results.

For this fall/summer, we are specifically interested in replicating studies about machine learning algorithms (usually written in Python or R) over real-world software systems (e.g., Linux, ffmpeg). For instance, there are many learning-based algorithms that claim to find optimal software configurations, but is it really the case for the data we have in our research team? The candidate should not necessarily have an in-depth knowledge of machine learning (the internship is a good excuse to acquire such a knowledge!), but she/he should be passionate about software and gathering/analysis of data.