Can we see if community comes back?

December 15, 2024

With data available from the message bus, and extracted user context, we can now start to understand this information and use it in interesting ways. In a past career, we used to focus very strongly on how New Hires were performing week on week to understand if changes in training programs were making a difference. Since I worked in the BPO (business process outsourcing) field, this meant we were hiring large volumes of people on a weekly basis.

In the Fedora context, understanding if someone creates a FAS account and then uses it again could be interesting – do people create an account, post on Fedora Discussion once, and then never come back? How do folks become new packagers with engagement? How are we engaging people who want to understand our community? And can we do this without looking at individual users but a community as a whole.

The below Jupyter notebook cell sets out to use the data processed using the grep2parquet Git repo and some preprocessing I’ve done to make this data available. Since it’s not perfect, it may not capture everything, but should show us if people return.

We use the user extracted parquet files to create a weekly group: the week someone created a FAS account and we lump those users together into a Cohort. This Cohort means we’ll start to look at data on a weekly basis based on when you joined so we can see if changing something like how we engage makes a difference. Maybe a new event? Or a badge series for new joiners?

We will then look into the future results of the data and look to see if we saw that user account ever return. For each week – we will track did we see that username return for any reason and then count that in that week.

This cell below will produce the graphic:

# Import required libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime, timedelta

# Set the path to the directory containing the parquet files
parquet_dir = "parquet_output"

# Initialize an empty DataFrame to store combined data
combined_df = pd.DataFrame()

# Read all parquet files from the directory
for file in os.listdir(parquet_dir):
    if file.endswith(".parquet"):
        file_path = os.path.join(parquet_dir, file)
        df = pd.read_parquet(file_path)  # Read parquet file
        df['sent_at'] = pd.to_datetime(df['sent_at'], errors='coerce').dt.floor('s')  # Parse 'sent_at'
        combined_df = pd.concat([combined_df, df], ignore_index=True)

# Drop rows with invalid 'sent_at' timestamps
combined_df.dropna(subset=['sent_at'], inplace=True)

# Determine the maximum date in the data
max_date = combined_df['sent_at'].max().date()
print(f"Maximum date in data: {max_date}")

# Filter new users based on the topic 'org.fedoraproject.prod.fas.user.create'
new_users_df = combined_df[combined_df['topic'] == 'org.fedoraproject.prod.fas.user.create']

# Assign cohorts based on the week of user creation
new_users_df['cohort_week'] = new_users_df['sent_at'].dt.to_period('W').dt.start_time
new_users_df['cohort_label'] = new_users_df['cohort_week'].dt.strftime('Week of %m/%d')

# Merge cohort info with all activities
activity_with_cohorts = combined_df.merge(
    new_users_df[['username', 'cohort_week', 'cohort_label']], 
    on='username', 
    how='inner'
)

# Calculate weeks since cohort creation
activity_with_cohorts['week_since_cohort'] = (
    activity_with_cohorts['sent_at'] - activity_with_cohorts['cohort_week']
).dt.days // 7

# Group by cohort and week to calculate returning users
weekly_activity = (
    activity_with_cohorts
    .groupby(['cohort_label', 'week_since_cohort'])['username']
    .nunique()
    .reset_index()
)

# Pivot to create a retention table
cohort_retention_table = weekly_activity.pivot(
    index='cohort_label', 
    columns='week_since_cohort', 
    values='username'
).fillna(0)

# Get cohort sizes (number of new users in each cohort)
cohort_sizes = new_users_df.groupby('cohort_label')['username'].nunique()

# Ensure that cohort retention matches the cohort sizes
cohort_retention_table = cohort_retention_table.reindex(cohort_sizes.index)

# **Use the maximum date from the data instead of the current system date**
current_date = max_date  # Replace datetime.now().date() with max_date

print(f"Using current_date as: {current_date}")

# Create an annotated table with returned/size and N/A for invalid future cells
annotated_table = cohort_retention_table.copy()

for cohort in annotated_table.index:
    # Get the cohort start date
    cohort_start = pd.to_datetime(new_users_df[new_users_df['cohort_label'] == cohort]['cohort_week'].iloc[0]).date()
    for col in annotated_table.columns:
        current_week_date = cohort_start + timedelta(weeks=col)
        if current_week_date > current_date:
            annotated_table.loc[cohort, col] = "N/A"
        else:
            returned = cohort_retention_table.loc[cohort, col]
            size = cohort_sizes[cohort]
            annotated_table.loc[cohort, col] = f"{int(returned)}/{int(size)}"

# Normalize weekly activity by cohort size to get retention rates
retention_rate = cohort_retention_table.div(cohort_sizes, axis=0) * 100

# Replace future weeks with NaN so they don't render in the heatmap
for cohort in retention_rate.index:
    cohort_start = pd.to_datetime(new_users_df[new_users_df['cohort_label'] == cohort]['cohort_week'].iloc[0]).date()
    for col in retention_rate.columns:
        current_week_date = cohort_start + timedelta(weeks=col)
        if current_week_date > current_date:
            retention_rate.loc[cohort, col] = np.nan  # Use NaN for blank cells

# Limit to a reasonable number of weeks (e.g., 12 weeks)
retention_rate = retention_rate.iloc[:, :12]
annotated_table = annotated_table.iloc[:, :12]

# Plot the heatmap for weekly cohort retention rates with annotations
plt.figure(figsize=(16, 10))
sns.heatmap(
    retention_rate, 
    annot=annotated_table,  # Display returned/size or N/A
    fmt="",                 # No specific number formatting
    cmap="Blues", 
    cbar_kws={'label': 'Retention Rate (%)'},
    linewidths=0.5,
    linecolor='gray'
)
plt.title('Weekly Cohort Retention Over Time')
plt.xlabel('Weeks Since Cohort Creation')
plt.ylabel('Cohort Week')
plt.xticks(ticks=range(0, 12), labels=[f'Week {i}' for i in range(12)])
plt.yticks(rotation=0)  # Keep cohort labels horizontal
plt.tight_layout()
plt.show()

The image above shows a 12 week view into this data from June 1st, 2024 to December 15th, 2024.

I ran this example above on December 15th, 2024, so future weeks may be blank because they haven’t happened yet, but it does show a pretty strong drop off of activity after only 3 weeks. Of the users who create a Fedora FAS account, on average we retain only 5-6 of those users after 12 weeks. What happened to the other 300? Also – the week of 12/02 and 12/09, there are a lot of messages for invalid accounts. Are we getting more spam sign up than actual users? More questions to be answered.

Let’s filter out anyone who has less than 2 events after their creation date from the cohort groups, meaning we’ll have less users in each week, but more likely real people.

# Identify users who have at least 2 events after Day 0
valid_users = (
    activity_with_cohorts.groupby('username')
    .size()
    .reset_index(name='event_count')
    .query('event_count >= 2')['username']
)

# Filter the activity data for valid users only
activity_with_cohorts = activity_with_cohorts[activity_with_cohorts['username'].isin(valid_users)]

When we remove those users, we still see a similar pattern plus the last two weeks seems more loud than the rest of the dataset, so maybe we’re seeing more spam in the network. When I look up one of those usernames in the Parquet files, I am seeing a lot of events happening really quickly. For privacy, I removed the username from this view / hid the ID from this.

It looks like in the last few weeks, users are signing up to create Pagure events. This likely relates to the remarks folks are making that Pagure is getting spammed with issues. This may inspire a next round of review to understand if we can spot spammers in the bus and help the infrastructure team.