Can we see trends in the topics?
With the topic data and users identified, we can start to see patterns and spot trends in the data which may be more relevant in different ways – such as topics which are fluctuating or tell us usage patterns of our community.
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read and combine Parquet files
combined_df = pd.DataFrame()
for file in os.listdir("parquet_output"):
if file.endswith(".parquet"):
file_path = os.path.join("parquet_output", file)
try:
df = pd.read_parquet(file_path)
df['sent_at'] = pd.to_datetime(df['sent_at'], errors='coerce')
combined_df = pd.concat([combined_df, df], ignore_index=True)
print(f"Successfully read {file}")
except Exception as e:
print(f"Error reading {file}: {e}")
# Clean data
initial_count = combined_df.shape[0]
combined_df.dropna(subset=['sent_at'], inplace=True)
cleaned_count = combined_df.shape[0]
print(f"Dropped {initial_count - cleaned_count} rows due to invalid '{'sent_at'}'.")
# Assign week start and label
combined_df['week_start'] = combined_df['sent_at'].dt.to_period('W').dt.start_time
combined_df['week_label'] = combined_df['week_start'].dt.strftime('Week of %Y-%m-%d')
# Aggregate distinct users
aggregated_df = combined_df.groupby(['week_start', 'week_label', 'topic'])['username'].nunique().reset_index(name='distinct_user_count')
# Pivot for heatmap
heatmap_pivot = aggregated_df.pivot(index='week_start', columns='topic', values='distinct_user_count').fillna(0)
heatmap_pivot.sort_index(inplace=True)
heatmap_pivot.index = heatmap_pivot.index.strftime('Week of %Y-%m-%d')
# Select top N topics
top_topics = aggregated_df.groupby('topic')['distinct_user_count'].sum().nlargest(20).index
heatmap_top = heatmap_pivot[top_topics]
# Plot heatmap
plt.figure(figsize=(20, 12))
sns.heatmap(
heatmap_top,
annot=True,
fmt=".0f",
cmap='rocket_r',
linewidths=0.5,
linecolor='gray',
cbar_kws={'label': 'Number of Distinct Users'}
)
plt.title(f'Weekly Distinct Users for Top {20} Topics', fontsize=18)
plt.xlabel('Topic', fontsize=14)
plt.ylabel('Week', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
In the example above, we can see that on a trending basis, many users are using the COPR build system, but we don’t see as much traffic on the Discourse side to make a top of the list. This could mean people are contributing custom packages using Fedora build infrastructure, but maybe not engaging with the community who are using these packages.
We can also see that there are pretty consistent trends with Pagure until late, with around 5-9 projects being added a week (until recently, which is likely spam).
This may be a good view down the road to see how the community is engaging with Fedora.