Tera-scale graph processing on the Software Heritage graph

by Valentin Lorentz
15/02/2024
DiverSE Coffee
Rennes, France

Abstract

I will present my work on the compressed graph representation of the Software Heritage archive, whose latest version features 34 billion nodes and 520 billion edges.

I started by building on top of someone’s PhD project (in Java), and am now rewriting it in Rust, leading to a 2 to 6x performance gain.

Lessons learned include unexpected data structures, columnar file formats (ORC), annoying multi-threading bugs, and arguments for your future language flamewars.


Following Valentin’s talk, we will have a tutorial from 14:15 to 17:30 to get started with Software Heritage’s API. See the program below.

Tutorial: Querying Software Heritage Archive

With the help of Valentin Lorentz, we will explore the multiple ways data can be extracted from the Software Heritage archive: REST API, GraphQL API, remote queries of swh-graph via HTTP and gRPC, swh-graph direct API from Rust, Java and Python…

Make sure that you have followed these steps before the tutorial

Program

  • List all directories and files from a given Git commit
  • List the URL of known origins for a given Git commit
  • Searching for origins
  • Query the swh-graph server on diverci
  • Write your own Rust package that uses swh-graph
  • Find the earliest revision/release containing a particular directory/content

Resources