Symphony Church Connection Card Data

analysis

Importing modules

First, we need to import modules that we're going to need. Here, I may use numpy, pandas, and matplotlib.

In [1]:
# "as" is useful so that we can just call type "np" to use this module instead of typing numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# This allows us to plot things in the notebook directly.
%matplotlib inline

Importing data

There are many ways to import data. However, for data analysis, importing data as pandas's DataFrame structure is probably most useful. It's pretty good at importing the columns into correct data types. I use read_csv command available from pandas module. Data has already been cleaned to remove any error messages, fix typos in the data, and make data type consistent for each column. Identifiable personal information also have been removed.

In [2]:
# Make sure data.csv file is in the same folder
data = pd.read_csv('data.csv')
In [3]:
# We can now look at the data that's been imported.  .head() function on a pd.DataFrame shows first five rows.
data.head()
Out[3]:
id alumni birthday city state zipcode life_stage grad_year first_sunday_id second_sunday_id ... sunday_id commit_life family_group_interest member_interest serving_interest baptism_interest alpha_interest growth_module_interest service_site sunday
0 1 1.0 1989 Boston MA 2215 young_adult 2012 NaN NaN ... 3 NaN NaN NaN NaN NaN NaN NaN NaN 1/15/12
1 1 1.0 1989 Boston MA 2215 young_adult 2012 NaN NaN ... 4 NaN NaN NaN NaN NaN NaN NaN NaN 1/22/12
2 1 1.0 1989 Boston MA 2215 young_adult 2012 NaN NaN ... 5 NaN NaN NaN NaN NaN NaN NaN NaN 2/5/12
3 1 1.0 1989 Boston MA 2215 young_adult 2012 NaN NaN ... 7 NaN NaN NaN NaN NaN NaN NaN NaN 2/19/12
4 1 1.0 1989 Boston MA 2215 young_adult 2012 NaN NaN ... 8 NaN NaN NaN NaN NaN NaN NaN NaN 2/26/12

5 rows × 21 columns

Visualize

Now that we have imported the data succesfully, we can visualize some simple data.

Attendance by Sunday

First, let's look at number of connection cards filled out per Sunday. For this visualization, we use barplot, because we're counting by distinct states (Sundays).

In [4]:
'''
I choose to use sunday_id column instead of sunday column,
because sunday_id column is an array of numbers that increase chronologically,
whereas sunday column is an array of strings that loops every year (1/15/2016 is before 3/2/2012).
Then I create a list of strings that correspond to the sunday_id list.
'''
sundays = data.sunday_id
sundays_unique = sorted(set(sundays))
sundays_string = [list(data.sunday[data.sunday_id==i])[0] for i in sundays_unique]
In [5]:
count_sundays = [sum(sundays == i) for i in sundays_unique]
In [6]:
plt.figure(figsize=(15,10))
plt.bar(range(len(sundays_unique)),count_sundays)
ax = plt.gca()
ax.set_xticks(np.arange(0,237,26))
ax.set_xticklabels(np.array(sundays_string)[np.arange(0,237,26)])
plt.show; # the semicolon supresses the annoying message <function matplotlib.pyplot.show>

You can see the slowly increasing attendance until 2015. You can see the deep wide dips during summer every year, the deep narrow dips during winter every year. In the spring, there's typically a single high peak, which probably corresponds to Easter service. There's nothing knew here; it's showing us what we would've expected. But it's still nice to see in a plot.

Attendance by member

Now, let's look at attendance by members. Here, we'll categorize members by how many times they have attended. To visualize this data, we'll use histogram, since we're looking at frequency distribution.

We'll write a function to create histogram with a maximum count. I don't think we need to distinguish between people who attended 30 times or 50 times or 100 times. So we'll look at 1 through N-1 days individually, then N or more days.

In [7]:
ids = data.id
In [8]:
def hist(array,maximum=0):
    unique = sorted(set(array))
    if maximum>0:
        return [min(sum(array==i),maximum) for i in unique]
    else:
        return [sum(array==i) for i in unique]
In [9]:
hist_ids30 = hist(ids,maximum=30)
In [10]:
plt.figure(figsize=(8,6))
plt.hist(hist_ids30,bins=30)
plt.show;

There's a lot of 1-time comers, then continues to decrease significantly to higher attendances. We can zoom in a little more and make the max 20. (Feel free to zoom in even more)

In [11]:
hist_ids20 = hist(ids,maximum=20)
In [12]:
plt.figure(figsize=(8,6))
plt.hist(hist_ids20,bins=20)
plt.show;

Create new columns

It's sometimes useful to create new columns that's easier to work with. For example, I want to divide the Sundays into different years (Sept to Aug)---let's name them 0 (Jan 2012 - Aug 2012) to 4 (Sept 2015 to June 2016).

In [13]:
def year(datestr):
    mo, _, yr = datestr.split('/')
    return (int(yr)-12+int(int(mo)/9))
In [14]:
data['year'] = np.array([year(i) for i in data['sunday']])

Creating new DataFrame

We can continue to work with the DataFrame we have above, but it's a little annoying when you're trying to do studies about unique members. I'm going to make a new DataFrame with each row representing a unique member, copying over the member-specific information and adding some additional columns.

In [15]:
columns_to_copy = ['id','birthday','city','state','zipcode','life_stage','grad_year','first_sunday_id','second_sunday_id','how_hear']
data_member = data[columns_to_copy][0:1]
for i in data.index[1:]:
    if data.id[i-1] != data.id[i]:
        data_member = data_member.append(data[columns_to_copy][i:i+1],ignore_index=True)
In [16]:
data_member['attendance'] = np.array([sum(data.id==i) for i in data_member.id])
In [17]:
data_member.head()
Out[17]:
id birthday city state zipcode life_stage grad_year first_sunday_id second_sunday_id how_hear attendance
0 1 1989 Boston MA 2215 young_adult 2012 NaN NaN Other 49
1 2 1993 Brookline MA 2446 grad 2015 NaN NaN Other 93
2 3 1974 Allston MA 2134 grad 2012 NaN NaN Other 9
3 6 0 San Francisco CA 94116 grad 2014 NaN NaN Other 7
4 8 1992 NY NY 10016 grad 2014 NaN NaN Other 41

Life stage

We can see if attendance looks different for different life stages. First, it's important to see how many people are in each life stage.

In [18]:
life_stage = [x for x in list(set(data_member.life_stage)) if str(x) != 'nan']
# This gives us ['young_adult', 'married', 'high_school', 'grad', 'undergrad']
# Let's manually order it, since there's no way of automatically ordering it logically.
life_stage = ['high_school','undergrad','grad','young_adult','married']
life_stage_numbers = []
for i in life_stage:
    life_stage_numbers.append(sum(data_member.life_stage == i))
plt.pie(life_stage_numbers, labels=life_stage, startangle=90,
       autopct = lambda p: '{:.0f}'.format(p * len(data_member.life_stage) / 100));

It's hard to see, but count for high school is 5 and married is 16. It's very low, and analysis on these two groups probably won't mean much.

We can do many different analysis, but let's just do the simplest thing here at look at average attendance by each group. Of course, this is probably an unfair comparison, as there are probably a lot of 1's in the undergraduate population as they tend to church hop in big groups in the beginning of every school year.

In [19]:
life_stage_attendance = [data_member.attendance[data_member.life_stage==x].mean() for x in life_stage]
print(list(zip(life_stage,life_stage_attendance)))

plt.figure(figsize=(10,6))
plt.bar(range(len(life_stage)),life_stage_attendance)
ax = plt.gca()
ax.set_xticklabels(life_stage)
plt.show;
[('high_school', 1.2), ('undergrad', 8.796958174904942), ('grad', 17.063432835820894), ('young_adult', 25.988593155893536), ('married', 11.0)]

Final remarks

This was made as an instructional guide, and I did not really dive deep (as I need to focus on my PhD requirements at the moment...). I hope it was helpful. If you want to work more on it, please let me know.