BangPypers Cover Logo

On Feb 22, 2020, I was invited to give the opening talk on the 100th meeting of BangPypers, the Python community I had founded in 2005. It is not every day that a community you helped bootstrap thrives for 15 years and you get a chance to address their 100th meeting.

It was indeed a unique experience and I wanted to prepare a unique talk for the event.

Background

The BangPypers community has its presence in multiple web platforms at present - the primary ones being the meetup site, the twitter handle and the website. However it had very humble origins in a Yahoo! group (extinct at present) in Feb 2005 and later on from Sep 2007 as a mailing list at Python.org mailman, which plays host to a number of such community groups.

The mailman archives are public and I thought it might be a good idea to analyze data from the archives over the years and present a talk which shows the growth, activity and dynamics of the group as graphs, perhaps with some topic modeling thrown in. This is what I ended up doing.

The task was not a minor one and it took me around 2 weeks in total to collect the ML archives, extract relevant data, perform analysis and prepare the graphs. In this post and the next two ones, I will cover this in some detail with generous dollops of code thrown in. All done in Python of course!

Mailman Archives

The Mailman Archives for the group were my starting point. For a given year and month, the gzipped archive is available (for this group) at the URL https://mail.python.org/pipermail/bangpypers/[year]-[month-name].txt.gz . For example, the archive for Oct 2007 is at the URL https://mail.python.org/pipermail/bangpypers/2007-October.txt.gz .

To get started on analyzing the archives, one has to download them first. This was the first step of the data preparation.

All code is in Python3.

Downloading the Archives

The following Python function downloads all archives of BangPypers from 2007 till 2019 and writes them to the current folder.

import os
import requests
import calendar

def download_archives(start=2007, end=2019):
    """ Download all archives from given start and end """

    for y in range(start, end+1):
        for m in range(1, 13):
            month = calendar.month_name[m]
            url = 'https://mail.python.org/pipermail/bangpypers/{}-{}.txt.gz'.format(y,month)
            f = requests.get(url)

            if f.status_code == 200:
                print('Downloaded for {} {}'.format(month, y))
                fname = url.rsplit('/')[-1]
                open(fname, 'wb').write(f.content)
                os.system('gunzip {}'.format(fname))
            elif f.status_code == 404:
                print('Not found for {} {}'.format(month, y))

Once this function is executed, all archives from 2007 September to 2019 December are downloaded.

Classifying Archives by Year

To prepare year-wise and month-wise statistics and to plot graphs, I had to next group the archives by year. That was done by the following function.

import os
import shutil
import calendar

def classify_year(root='archives', start=2007, end=2019):
    """ Move the files and arrange according to years """

    for y in range(start, end+1):
        folder = os.path.join(root, str(y))
        if not os.path.isdir(folder): os.makedirs(folder)
        
        for m in range(1, 13):
            month = calendar.month_name[m]
            fname = '{}-{}.txt'.format(y, month)
            if os.path.isfile(fname):
                print('Moving {} to {}/{}'.format(fname, folder, fname))
                # Move it to the correct folder
                shutil.move(fname, folder)

Once this script ran, all the archives were grouped inside a folder named archives in the current folder, year-wise, each year folder containing the monthly archives for that year. Note that each archive is a text file.

$ ls archives/
2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019
$ ls -l archives/2010/
total 3336
-rw-rw-r-- 1 anand anand 391763 Mar  1 02:15 2010-April.txt
-rw-rw-r-- 1 anand anand 293335 Mar  1 02:15 2010-August.txt
-rw-rw-r-- 1 anand anand 185996 Mar  1 02:15 2010-December.txt
-rw-rw-r-- 1 anand anand 377002 Mar  1 02:15 2010-February.txt
-rw-rw-r-- 1 anand anand 309562 Mar  1 02:15 2010-January.txt
-rw-rw-r-- 1 anand anand 265406 Mar  1 02:15 2010-July.txt
-rw-rw-r-- 1 anand anand 469227 Mar  1 02:15 2010-June.txt
-rw-rw-r-- 1 anand anand 168021 Mar  1 02:15 2010-March.txt
-rw-rw-r-- 1 anand anand 286865 Mar  1 02:15 2010-May.txt
-rw-rw-r-- 1 anand anand 154441 Mar  1 02:15 2010-November.txt
-rw-rw-r-- 1 anand anand 317091 Mar  1 02:15 2010-October.txt
-rw-rw-r-- 1 anand anand  73176 Mar  1 02:15 2010-September.txt

Extracting Information

Each monthly archive of mailman is a single text file containing all emails of that month, with complete headers. For example, this is how the archive of September 2007 - the very first month of the mailing list - starts.

From abpillai at gmail.com  Fri Sep 14 10:18:24 2007
From: abpillai at gmail.com (Anand Balachandran Pillai)
Date: Fri, 14 Sep 2007 13:48:24 +0530
Subject: [BangPypers] Testing the list
Message-ID: <8548c5f30709140118t7eaf44b8y995d5a9073300018@mail.gmail.com>

Testing the list. Please ignore...

Thanks

-- 
-Anand

From stylesen at gmail.com  Fri Sep 14 10:22:09 2007
From: stylesen at gmail.com (Senthil Kumaran S)
Date: Fri, 14 Sep 2007 13:52:09 +0530
Subject: [BangPypers] Why is Bangpypers list private?
Message-ID: <5ef2113c0709140122m977e1beob6ca34b56d37668@mail.gmail.com>

Hi,

I would like to know, why bangpypers list in
http://mail.python.org/mailman/ is marked as a private list? Is there
any specific reason? If not, We need to make the list as a public
list, since anyone who wants to join the list must be able to do so
after browsing through the archives. Also it is good to open our list
for search engines :)

-- 
Senthil Kumaran S
http://www.stylesen.org

<snip>
...

To extract emails from this single text file, one needs the help of regular expressions.

As you can see each email starts with two From lines, one containing a timestamp and the next, the name of the poster. Each line contains the email of the poster in the form of sender at domain.tld .

Following this grammar, the following regular expression can be prepared which allows one to locate the From line of an email which marks its boundary.

from_sender_regex = re.compile('From\s([^\s]+\sat\s[^\.]+\.[a-z0-9]+)(.*)\n\
                                From\:\s[^\s]+\sat\s[^\.]+\.[a-z0-9]+',
                                re.MULTILINE|re.UNICODE)

This regular expression defines a group which allows one to extract the email address of the sender.

Every email in Mailman has a unique Message-ID which is very useful as a unique identifier for an email. The following regular expression helps to identify and extract the message-id in each email.

message_id_regex = re.compile('Message\-ID\:\s+\<([^<>]+)\>')

Sender, Monthly and Yearly Statistics

Using the from_sender_regex regular expression discussed above, it is possible to write a function that retrives sender and email statistics from the ML over the years and writes it to JSON files.

import os
import calendar
import re
import json
from collections import defaultdict

from_sender_regex = re.compile('From\s([^\s]+\sat\s[^\.]+\.[a-z0-9]+)(.*)\nFrom\:\s[^\s]+\sat\s[^\.]+\.[a-z0-9]+',
                                re.MULTILINE|re.UNICODE)

def extract_email_stats(root='archives', start=2007, end=2019):
    """ Extract emails stats from archives """

    year_email_stats = defaultdict(int)
    month_email_stats = defaultdict(int)
    sender_stats = defaultdict(int)
    
    # Use "From <sender> at <origin><dot><tld>" as a regex to separate emails in text file
    for y in range(start, end+1):
        n_emails = 0
        for m in range(1, 13):
            month = calendar.month_name[m]
            fname = os.path.join(root, str(y), '{}-{}.txt'.format(y, month))
            if os.path.isfile(fname):
                data = open(fname, 'rb').read().decode('latin-1')
                # Find all emails using the sender regex
                from_parts = from_sender_regex.findall(data)
                # Count of emails
                n_emails = len(from_parts)

                # Extract sendor email addresses
                senders = [x[0].replace(' at ','@').strip() for x in from_parts]
                for sender in senders:
                    sender_stats[sender] += 1
                    
                print('{} emails from {}'.format(n_emails, fname))
                month_email_stats['/'.join((str(y), str(m)))] = n_emails
                year_email_stats[str(y)] += n_emails

    json.dump(month_email_stats, open('global_stats_month.json','w'), indent=4)
    json.dump(year_email_stats, open('global_stats_year.json','w'), indent=4)
    json.dump(sender_stats, open('global_stats_sender.json','w'), indent=4)

The above function parses the email archives and using the sender regex, gets a count of all emails sent in a month. It adds the email counts to separate dictionaries per year and per month.

Also, it extracts the sender emails and increments the count of emails per sender and adds it to a dictionary keeping email counts per sender. At the end of the function, these dictionaries are dumped to JSON files.

Threads in Mailman

If a message is sent as a reply to another post, then it starts forming an email thread in Mailman. Mailman identifies threads by using a References field in the email header. The message-id of the original email to which the reply is sent is referenced in this field.

For example,

From stylesen at gmail.com  Fri Sep 14 10:34:17 2007
From: stylesen at gmail.com (Senthil Kumaran S)
Date: Fri, 14 Sep 2007 14:04:17 +0530
Subject: [BangPypers] Why is Bangpypers list private?
In-Reply-To: <8548c5f30709140129p419dc716uacaa69d747bd2876@mail.gmail.com>
References: <5ef2113c0709140122m977e1beob6ca34b56d37668@mail.gmail.com>
    <8548c5f30709140129p419dc716uacaa69d747bd2876@mail.gmail.com>
Message-ID: <5ef2113c0709140134u2074d126y5a97114524d76d19@mail.gmail.com>

On 9/14/07, Anand Balachandran Pillai <abpillai at gmail.com> wrote:
> Sen, this was because Jeff created it like that. I have changed
> it so that the list is visible to anyone, not just the members.

Thank you for that :)

> However email addresses will be slightly obfuscated so that
> we don't expose ourselves to spamming.

Yeah, this is important.

This is the very first thread in the mailing list. It references the following email.

From stylesen at gmail.com  Fri Sep 14 10:22:09 2007
From: stylesen at gmail.com (Senthil Kumaran S)
Date: Fri, 14 Sep 2007 13:52:09 +0530
Subject: [BangPypers] Why is Bangpypers list private?
Message-ID: <5ef2113c0709140122m977e1beob6ca34b56d37668@mail.gmail.com>

Hi,

I would like to know, why bangpypers list in
http://mail.python.org/mailman/ is marked as a private list? Is there
any specific reason? If not, We need to make the list as a public
list, since anyone who wants to join the list must be able to do so
after browsing through the archives. Also it is good to open our list
for search engines :)

-- 
Senthil Kumaran S
http://www.stylesen.org

In other words, the above email was the first one that started this thread.

Hence by finding and marking the References field, one can build the full picture of an email thread from Mailman archives - including the posters, their email addresses, the size of the thread (number of messages in it) and sub-threads if any.

The following regex helps to identify the References field.

reference_id_regex = re.compile('References\:\s+\<([^<>]+)\>')

Putting it all together

Here is the code which puts together the pieces discussed above and extracts the individual emails from each monthly archive. It writes the emails using a filename which is created from the MD5 hash of the Message-ID of the email. Apart from writing the individual emails, it also constructs thread statistics using two dictionaries, which are written out as JSON dumps.

import os
import re
import calendar
import json
import hashlib
from collections import defaultdict

from_sender_regex = re.compile('From\s([^\s]+\sat\s[^\.]+\.[a-z0-9]+)(.*)\nFrom\:\s[^\s]+\sat\s[^\.]+\.[a-z0-9]+',
                                re.MULTILINE|re.UNICODE)
message_id_regex = re.compile('Message\-ID\:\s+\<([^<>]+)\>')
reference_id_regex = re.compile('References\:\s+\<([^<>]+)\>')

def extract_emails(root='archives', years=range(2007, 2020)):
    """ Extract emails from archives - also builds thread statistics """

    thread_stats = defaultdict(int)
    all_threads = defaultdict(list)
    sender_to_msgid = {}
    
    # Use "From <sender> at <origin><dot><tld>" as a regex to separate emails in text file
    for y in years:
        n_emails = 0
        m_count = 0

        for m in range(1, 13):
            month = calendar.month_name[m]
            fname = os.path.join(root, str(y), '{}-{}.txt'.format(y, month))
            if os.path.isfile(fname):
                m_count += 1
                data = open(fname, 'rb').read().decode('latin-1')

                emails = []
                count = 0

                month_dir = os.path.join(root, str(y), month)
                if not os.path.isdir(month_dir):
                    os.makedirs(month_dir)
                    
                print('Processing for {}/{}'.format(y, month))
                while True:
                    m1 = from_sender_regex.search(data)
                    if m1 == None:
                        print('No further emails')
                        break
                    
                    idx1_1 = m1.start()
                    idx1_2 = m1.end()
                    # Start of next email
                    data2 = data[idx1_2:]

                    m2 = from_sender_regex.search(data2)
                    if m2 == None:
                        print('No next emails')
                        # Append data till now
                        email_data = data[idx1_1:]
                        emails.append(email_data)
                        count += 1
                        # print('{} Email, length => {}'.format(count, len(email_data)))                      
                        break

                    idx2_1 = m2.start()
                    # Email text is between these two indices
                    email_data = data[idx1_1:idx1_2] + data2[:idx2_1]
                    emails.append(email_data)
                    
                    count += 1
                    # print('{} Email, length => {}'.format(count, len(email_data)))
                    data = data2[m2.start():]

                for email_data in emails:
                    # Extract message-id
                    try:
                        msg_id = message_id_regex.findall(email_data)[0].strip()
                    except IndexError:
                        pass

                    from_parts = from_sender_regex.findall(email_data)
                    sender = from_parts[0][0].replace(' at ','@').strip()
                    sender_to_msgid[msg_id] = sender

                    msg_idh = hashlib.md5(msg_id.encode('utf-8')).hexdigest()
                    email_path = os.path.join(month_dir, msg_idh + '.eml')
                    print('Writing message {}'.format(email_path))
                    open(email_path, 'w').write(email_data)

                    # Figure out references
                    try:
                        references_id = reference_id_regex.findall(email_data)[0]
                        orig_sender = sender_to_msgid.get(references_id)

                        thread_key = '/'.join((str(y), month, references_id))
                        thread_stats[thread_key] += 1

                        if orig_sender != None:
                            coll = all_threads[thread_key]
                            # print('ORIGINAL SENDER =>',orig_sender)
                            if coll[0] != orig_sender:
                                print('Inserting orig sender',orig_sender)
                                coll.insert(0, orig_sender)

                        if sender != orig_sender:
                            all_threads[thread_key].append(sender)
                    except IndexError:
                        single_msg_key = '/'.join((str(y), month, msg_id))                          
                        all_threads[single_msg_key].append(sender)                          
                        pass

        # Write global thread stats
        json.dump(thread_stats, open('global_thread_stats.json','w'), indent=4)
        # Write global thread graph
        json.dump(all_threads, open('global_thread_graph.json','w'), indent=4)

When you run this, you will find that it goes through each archive across the years, writes out the emails in separate files and also collects thread statistics, writing them to two separate JSON files.

$ python prepare_data.py
...
Processing for 2007/September
No next emails
Writing message archives/2007/September/2efd544e7e898bedab277dc525d16cc3.eml
Writing message archives/2007/September/b997876eed0f4994567650b51b3a4bf6.eml
Writing message archives/2007/September/d729aff7f251e4ba30f9c1680cf54dcd.eml
Writing message archives/2007/September/03950f198669718b742b9e3fe57923ee.eml
Writing message archives/2007/September/e3beee2ae03ee902856399e64e8e7356.eml
Writing message archives/2007/September/a58ebaddee233b28cfd5c9a4e314f352.eml
Writing message archives/2007/September/27682f5d6192a9b91b2896147640e802.eml
Writing message archives/2007/September/3225510ca09d1c6f97a69e9a449d3dea.eml
...
...
...
Writing message archives/2019/November/06c59955770fb8bacda459854435f573.eml
Writing message archives/2019/November/20d338e7a0165f583d8330eea1802bed.eml
Processing for 2019/December
No next emails
Writing message archives/2019/December/5b526d3403fa0b3d11142b0cc20be598.eml
Writing message archives/2019/December/76d3e8dbe475f54a808d788bc0d0013b.eml
Writing message archives/2019/December/82f910da593e6326d21df78835dbb28a.eml
Writing message archives/2019/December/39f1f41c8e514b08f662366a7b181499.eml
Writing message archives/2019/December/7a1d802c5b99c3d5f76a727bea8c1637.eml

Next Steps

At the end of the last step, we have now have a script which allows us to download email archives from Mailman, classify it by year, extract sender, email and thread statistics and writes out individual emails to the disk. The data is now ready for analysis and visualization.

In the next post, I will take this data and discuss how I used it to develop visualizations of email, sender and email-thread statistics and a network visualization of the group activity.

For all code developed as part of this blog post, check out this repo .


Note that name and e-mail are required for posting comments