Tera-scale graph processing on the Software Heritage graph
by Valentin Lorentz
15/02/2024
DiverSE Coffee
Rennes, France
Abstract
I will present my work on the compressed graph representation of the Software Heritage archive, whose latest version features 34 billion nodes and 520 billion edges.
I started by building on top of someone’s PhD project (in Java), and am now rewriting it in Rust, leading to a 2 to 6x performance gain.
Lessons learned include unexpected data structures, columnar file formats (ORC), annoying multi-threading bugs, and arguments for your future language flamewars.
Following Valentin’s talk, we will have a tutorial from 14:15 to 17:30 to get started with Software Heritage’s API. See the program below.
Tutorial: Querying Software Heritage Archive
With the help of Valentin Lorentz, we will explore the multiple ways data can be extracted from the Software Heritage archive: REST API, GraphQL API, remote queries of swh-graph via HTTP and gRPC, swh-graph direct API from Rust, Java and Python…
Make sure that you have followed these steps before the tutorial
- Create an account on https://archive.softwareheritage.org/ so you can use the API with higher rate-limits
- Send your account name to Valentin to get access to https://archive.softwareheritage.org/api/1/graph/
- Make sure you can ssh to the team’s dedicated machine
Program
- List all directories and files from a given Git commit
- List the URL of known origins for a given Git commit
- Searching for origins
- Query the swh-graph server on diverci
- Write your own Rust package that uses swh-graph
- Find the earliest revision/release containing a particular directory/content
Resources
- swh-graph documentation (excluding Java and Rust APIs)
- swh-graph Using the HTTP API (read first!)
- swh-graph Using the gRPC API
- swh-graph gRPC protocol full description
- swh-graph Rust API documentation (this is a temporary location for this workshop it will be on docs.rs after we release swh-graph Rust)
- Data sources provided by SWH
- Software Heritage REST API
- Software Heritage GraphQL API
- sha1collisiondetection (as of now, it can’t be archived by SWH, see the relevant milestone)