grep2parquet

December 15, 2024

Within the Fedora Project, we have a Messaging Bus available for the various applications which are interacting within the community, sending out updates whenever something happens: package updates, builds, test results, forum posts, meetings, and more.

Coming from my data & reporting days, I know the priority for an organization is not necessarily getting the precise truth for all reports – many cases the average of a figure and a number which has enough backing to it allows the business to focus on something else more key instead – what to do about the number.

In Fedora’s case, having access to the message bus allows us to tap into this knowlege and start to look to answer questions about the Community Health overall iwhtout requiring expensive integrations across the variety of platforms the project uses. Even better, the community already stores and logs this data into a tool for historical purposes called Datanommer, making the data available to us via a HTTP REST API called Datagrepper.

What’s All This Data About?

Each message on the bus contains a variety of information mostly useful for other applications to take advantage of – but by topic we can see some important information start to emerge:

https://apps.fedoraproject.org/datagrepper/v2/id?id=1cf56046-167e-45b3-9f5c-4830720d6797&is_raw=true&size=extra-large

{
  "body": {
    "build": 8395718,
    "chroot": "fedora-rawhide-x86_64",
    "copr": "PyPI",
    "ip": "2620:52:3:1:dead:beef:cafe:c108",
    "owner": "@copr",
    "pid": 2480762,
    "pkg": "python-pytest-black",
    "status": 1,
    "user": "ksurma",
    "version": "0.4.0-1",
    "what": "build end: user:ksurma copr:PyPI build:8395718 pkg:python-pytest-black version:0.4.0-1 ip:2620:52:3:1:dead:beef:cafe:c108 pid:2480762 status:1",
    "who": "backend.worker-rpm_build_worker:8395718-fedora-rawhide-x86_64"
  },
  "headers": {
    "fedora_messaging_schema": "copr.build.end",
    "fedora_messaging_severity": 20,
    "fedora_messaging_user_ksurma": true,
    "priority": 0,
    "sent-at": "2024-12-15T17:09:56+00:00"
  },
  "id": "1cf56046-167e-45b3-9f5c-4830720d6797",
  "priority": 0,
  "queue": null,
  "topic": "org.fedoraproject.prod.copr.build.end"
}

In this message, we see that a COPR build just finished based on it’s topic, the message was sent at December 15th, 2024 at 5:09 PM UTC, and likely user ksurma sent this message from a COPR build action. We also know the package from the message was “python-pytest-black”.

If we wanted to start to think about using this data – there’s already a trove of great information available to start thinking about:

How many people are using COPR?
How many people (or bots) are sending messages?
How is our community growing? What services to do they use?
And more…

How can we get access to this data?

To start thinking about this data we have a few ways users can get access to the message bus data:

Create a consumer application to listen on the bus and capture events (which is what Datanommer / Datagrepper do today)
Gain access to the Datanommer PostgreSQL database hosted on Fedora Infra
Consume the data from Datagrepper and make it available locally

After some understanding of where the community is today, creating a new tool and a new database just to examine the data of the community isn’t a great way to go. We have the tools, we just need to get access to the data.

In Community Operations (CommOps), we’ve been looking to make either the last two options be the priority. The challenges with the Infra access is that:

The direct PostgreSQL database is only available to users who are infrastructure apprentices, meaning we have to have a high amount of trust to users to login and use the production database. While I’ve done this myself, it’s made me nervious of how hard some of my queries might be against the system and how to scale this well – sharing passwords won’t work.
We also examined making a middle layer – like a BI platform available to the community. Since the database is a full fledge PostgreSQL database, any open source solution like Metabase, Apache Superset, and even this Django SQL Explorer module have been what we’ve examined. The problem being is hosting this application near or next to this database is more of a challenge than we anticipated. While we are working with Infra to make a copy of Datanommer’s database in the cloud for the community to use, it’s likely months out or longer.

grep2parquet

That brings us to our last option, and using a quick tool I’ve thrown together to get some movement on this: “grep2parquet.” It fetches historical data from DataGrepper REST API—basically a getting a batch of messages by day—and transforms it into Parquet format, a common choice for data analysis and which allows us to start doing something with this data.

The problem with this approach is it can be inequitable for some community members:

You must have appropriate disk space to store a copy of the data you need. Since you have to pull raw messages, this can grow (a single day is around 50mb of data compressed).
You have to have enough bandwidth to download this much data. Since it’s a large volume, it might take some time to just get started.

This also puts a drain on Fedora’s resources because we’re sending a lot of requests to the REST API in rapid order to get this data since we can only get 100 messages at a time for the bus. For now, this approach works – but in the future, I want to examine if we can store the predownloaded files somewhere in Fedora infra which require less machine processing and just require users to get what they need.

For now, this at least let’s us start moving – and some progress is better than none. I hope to share some updates soon on next steps.