This is the second part of the post on the talk I gave on the 100th meeting of BangPypers. In the first part, I had discussed how I had extracted data from the BangPypers mailing list and prepared some statistical data such as top posters, email counts over the years and thread statistics. In this post we will look at how the statistics can be visualized.

Preparing the data

You can prepare the data that is discussed in this post by checking out the bangypc repo and running the script prepare_data.py in a Python3 virtual environment. The script will fetch the ML archives, classify the archives by year, extract emails and build statistics and write them out in a few JSON files.

After the script runs, you will find the following files are created.

  1. global_stats_year.json - Python dictionary of total emails per year with year as the key in the form YYYY.
  2. global_stats_month.json - Python dictionary of total emails per month, per year with the key in the form of YYYY/M.
  3. global_stats_sender.json - Python dictionary of posters for the entire life of the mailing list, with the sender’s email address as key and their total email count as value.
  4. global_thread_stats.json - Python dictionary with Email thread ids as keys and number of emails per thread as value. Thread ids are in the form YYYY/Month-Name/Message-ID.
  5. global_thread_graph.json - Python dictionary with Email thread ids as keys and a list of thread participants (email addresses) as values.

Let us get to the visualizations! We will be using the versatile matplotlib library for this.

Requirements

If you plan to try out the code here, you need to install the matplotlib, pandas and networkx packages.

$ pip install matplotlib pandas networkx

Emails per year

The emails per year can be plotted as a Bar Graph. The code for this is shown below.

import json
import matplotlib.pyplot as plt

def plot_emails_years(filename='global_stats_year.json'):
    """ Plot emails over years as a bar chart """

    data = json.load(open(filename))
    years = list(data.keys())
    emails = list(data.values())

    plt.bar(years, emails, align='center', alpha=0.8, edgecolor='000',
            color=randomize_colors(len(years)))
    plt.xticks(np.arange(len(years)), rotation='45')
    plt.ylabel('Email')
    plt.xlabel('Year')
    plt.grid(True)
    plt.title('Year-wise Emails')

    plt.show()

To make the graph a bit dynamic, we use a function randomize_colors to genererate random colors for the bars everytime the graph is created. This function is shown below.

import random

def randomize_colors(length):
    """ Create an array of random colors of given length """

    return ['#' + ''.join(map(lambda x: random.choice('0123456789abcde'), range(6))) for i in range(length)]

If you now run the plotting function either as a script or in a jupyter notebook, it will plot the emails per year as a bar graph as shown below.

Year-wise Emails Bar Graph

You can see 2009 was the most busy year of the ML with close to 2500 posts.

Top Posters

The top posters for the entire ML through its existence, can be plotted as a Pie-chart. Here is the code for it.

import collections

def plot_top_posters_pie(filename='global_stats_sender.json', count=10, show_all=False):
    """ Plot top posters as a pie chart """

    data = json.load(open(filename))
    counter = collections.Counter(data)
    total_emails = sum(counter.values())
    print("Total emails =>",total_emails)
    top = dict(counter.most_common(count))
    
    labels = list(top.keys())
    nums = list(top.values())
    
    total_nums = sum(nums)
    total_rest = total_emails - total_nums
    print('Total at top =>',total_nums)
    
    if show_all:
        nums.append(total_rest)
        labels.append('Rest')
    
    plt.pie(nums, labels=labels,autopct='%1.2f',startangle=90, labeldistance=1.02,
            shadow=True, counterclock=True, colors=randomize_colors(len(nums)))

    plt.show()

Here is the Pie-chart this generates.

Top posters

You can see the email address noufal@gmail.com is the top Poster with about 23% share of all emails.

However, this is not the true picture of the mailing list as this chart only shows the top n posters and the percentage share is divided between them. To show how their posts contribute to the total share, the function can be executed with the show_all option set to True.

Here are the top 10 posters and their percent share of the entire mailing list as a Pie chart.

Top posters share

You can see now that the real share of the top Poster is around 7% of all total emails. The top 10 posters make up around 30% of all emails posted to the group!

Plotting Common Email Frequencies

We just saw, by way of the last two Pie charts that the mailing list is kind of dominated by a few posters making up around 30% of the total emails. So is there a way to find out what the “Rest” of the people were doing ?

Yes, this can be done by plotting the most common email frequencies in the list and try and find a correlation. Frequencies are best done using a Histogram .

Email Frequency Histogram

The email frequencies can be plotted as a Histogram using the following function.

from matplotlib.ticker import PercentFormatter

def plot_post_frequency(filename='global_stats_sender.json'):
    """ Plot email frequencies as a Histogram """

    data = json.load(open(filename))
    counts = []
    
    for v in data.values():
        counts.append(v)
        
    plt.hist(counts, bins=1000, density=True, facecolor='g')
    plt.xlabel('Count')
    plt.xlim(-10, 100)
    plt.grid(True)
    plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
    plt.ylabel('Percentage')
    plt.show()

This generates the following plot.

Email frequency Histogram

The Histogram tells us that nearly 45% of members of the mailing list have posted just one email to the group. This means there is (was) a kind of core group inside the list who is very active and takes part in most discussions with a rather non-participating outer layer who are more passive members of the group.

We will now look at the Thread level statistics and Network graphs which will support this hypotheses.

Untangling the Threads

The activity of a community is directly proportional to the frequency and depth of discussions in it. In a mailing list, such discussions can be seen most clearly as threads .

Threads are craeted when someone posts a topic of interest and many more people join the conversation leading to a long email chain of discussions centered on a single topic. Some-times threads branch off to sub-threads discussing related topics and often spawn entirely new threads.

When we prepared the data, we generated and saved such thread data in a couple of JSON files. We will now use these files to visualize thread statistics.

Top Email Threads

The following function plots a Bar graph showing year-wise thread counts with cut-off for a thread at 10 messages. In other words, it counts all mailman threads in the list which had a post count of 10 or more.

def plot_emails_threads(filename='global_thread_stats.json', cutoff=10):
    """ Plot year vs # threads with emai count >= cutoff in the entire year """
    
    data = json.load(open(filename))

    years = []
    values = []

    for year in range(2007, 2020):
        year_total = 0
        years.append(str(year))
        
        for key in data.keys():
            if key.startswith(str(year)):
                val = data[key]
                if val >= cutoff:
                    year_total += 1
                
        values.append(year_total)

    print(dict(zip(years,values)))
    plt.bar(years, values, align='center', alpha=0.8, edgecolor='000',
            color=randomize_colors(len(years)))
    plt.xticks(np.arange(len(years)), rotation='45')
    plt.ylabel('Large Threads (>={})'.format(cutoff))
    plt.xlabel('Year')
    plt.grid(True)
    plt.title('Year-wise top threads')

    plt.show()

Running this creates a graph that looks like the ones below (of course, the colors will vary!)

Email Threads Bar Graph

Again, 2009 proves to be an active year with 56 discussions showing a thread email count of 10 or more. Notice how the thread activity decreases towards the last few years along with email activity. This is kind of expected as the community began to move away from the mailing list towards other platforms as the group expanded and the range of activities broadened.

Let us summarize this post with a final look at another very interesting visualization - the growth of a community visualized as a network graph.

Community visualization using Network graphs

As said above, email threads are excellent parameters for visualizing the health and activity of a community as a mailing list. However rather than plotting simple Bar graphs, we can use this data and plot the organic growth and activity of the community as a Network graph.

Here is the basic idea behind this.

  1. We assume a Poster is a Node in the graph. We use the poster’s email address as the unique ID for the node.
  2. When posters take part in an email thread, we use the email as an edge between posters taking part in the discussion.
  3. We draw an edge from the original poster of an email thread (the person who started the thread) to each and every other person involved in the thread.
  4. We then plot the nodes and edges as a Network graph.

We have already captured this data in a JSON file during the data preparation stage. Now we use the excellent networkx package to generate a Graph with nodes and edges as discussed here to generate the graph.

Here is the code.

import networkx as nx

def plot_thread_network(filename='global_thread_graph.json',year=2009,month=None, save=False):
    """ Plot email threads and participants as a network graph """

    data = json.load(open(filename))

    G = nx.Graph()
    nodes = {}
    edges = []
    count = 0
    
    # A dense core with lots of edges indicate a very active year
    for key in data.keys():
        if key.startswith(str(year) + '/'):
            if month != None and key.find('/' + month + '/') == -1: continue 
            # Each participant is a node.
            participants = data[key]
            if len(participants) == 0: continue

            for person in participants:
                if person not in nodes:
                    # Add a node
                    G.add_node(person)
                    nodes[person] = 1
                    count += 1

            # Each thread defines edges.
            # First sender is the thread creator
            # Add edge from him/her to all else
            # in the thread
            sender = participants[0]
            for person in participants[1:]:
                # Dont add edge to same node
                if person != sender:
                    edge = (sender, person)
                    G.add_edge(*edge)
    
    options = {'node_size': 10,
               'width': 1}
    
    print(count)
    nx.draw(G, with_labels=False, font_weight='normal',
            node_color=range(count), cmap=plt.cm.Reds, **options)
    plt.text(-0.05, 0.75, str(year), size=20)
    if save: plt.savefig("network{}.png".format(year % 2005))

Here is how the community-as-a-network-graph looks in 2009.

Community Visualized as a Graph - 2009

You can see there is a very dense core containing lots of edges surrounded by a kind of sparse outer layer where people mostly hang around by themselves. The dense core represents high frequency threads and the less dense outer layers represent occasional participants - the 45% who post one or two messages and then lurk or watch by the sidelines.

The inner core is very dense and has a lot of connections and nodes. Here is a visualization that zooms into the core a bit showing the dense connections. This is the core of the community at that point in time.

Core of the community - 2009

Contrast this with the network graph and the inner core respectively for the year 2019 when a lot of the discussion has moved away from the mailing list.

Here is the network-as-a-graph for year 2019.

Community Visualized as a Graph - 2019

And here is its inner core.

Core of the community - 2019

Now the inner core is very sparse and the outer layer seems to have more participants.

There are exactly 12 core nodes and 11 edges in the inner core. The outer core is much more denser than it was in 2009. The community itself has become thinner with little core activity with most people just watching now.

As said before this is not because the community has become weak, but most likely because the activity has spread to other platforms. In the case of this community it seems to be the Meetup website, the Website, Twitter and Youtube feeds etc. In fact this kind of de-centralization happens when a community becomes mature. In the case of BangPypers, this is most probably what happened.

Network Visualization

So how did the community network evolve from the beginning to end of 2019 ? It would be tedious if I show here another 11 images apart from the ones already shown. Instead I am showing here two concise but different graphic representations of the networks.

First is an overlay of the networks for every year from 2007 to 2019. This is just the graph plots of each year overlaid on one another to produce an aggregate.

Community Overlay 2009-2019

This graph represents all activity on the BangPypers mailing list from inception to end of 2019 in the form of a connected graph.

The next is a video depicting the graph for all years from 2007 to 2019. Note that this is not an overlay and shows how the community graph changed from year to year.

Online Communities

Communities which grow and become successful are not very different from the BangPypers mailing list. Here are some features of such communities which can be derived from the data and visualizations shown here.

  1. There is an almost sudden and explosive organic growth in the early years of the community. This happens as more and more people become aware of the community and joins it at a rapid pace.
  2. The discussions are hectic and diverse in the early years due to the energy of the early pioneers who join and build the community. Core groups tend to organically form during such discussions. It depends on the policy and openness and the nature of the people in the group whether such groups thrive or fight and die off. If its the former, the group grows, otherwise the group stagnates.
  3. As the network visualizations show, just the mere size of the group doesn’t have a direct correlation with its activity. It is almost always a central core and sub-groups that form around this core which seem to perform the bulk of the activities/discussions. These nodes form what is known as a Community Structure, automatically organized around threads/topics .
  4. The activity of the central group can decrease, flucture around an average or increase over the years as the community matures or dies off. If it decreases it may mean the group has become mature and diversified into other platforms (as likely happened here) or that it has become less active and possibly dormant.

References

The code repo has been updated with all code developed here including the Jupyter notebook used to prepare this post, along with the content presented during the talk.


Note that name and e-mail are required for posting comments