Data Visualization

Movie data visualization using Dimple & PostgreSQL


Dimple is an easy library to use based on D3.js, a powerful JavaScript interactive visualization library. Above is a simple interactive visualization based on movie revenue data scraped from The Numbers. The data is served from AWS RDS (see below) as a dictionary of dictionaries—{‘Distributor 1’: {2017: 100000000, 2016: 120000000, …}, ‘Distributor 2’: {2017:130000000, 2016: 0, …}, …}. For example, data[‘Walt Disney’][2016] = 7192180925, meaning the Walt Disney Company made $7,192,180,925 in box office revenue for movies released in 2016.

I wrote several JavaScript callback functions to generate the plot. Here is the core of the Dimple code:

svg = dimple.newSvg('#svg', 800, 600);
chart = new dimple.chart(svg, grab_data(data,year));
x = chart.addCategoryAxis("x", 0);
y = chart.addMeasureAxis("y", 1);

where grab_data function transforms the original data to an array of [distributor,revenue] elements, which are used as ‘x’ & ‘y’ for the plot.

There is so much more to D3.js, especially with more data analysis, and I will add more visualizations later!

AWS web server & database

Another important aspect here is how the data is served. All the apps on this website are hosted on an Amazon Web Services (AWS) Elastic Compute Cloud (EC2) server, and the database is hosted on AWS Relational Database Service (RDS). Setting up the EC2 web server up along with other AWS services (RDS, S3, ELB, ACM, IAM) and Cloudflare (for DNS) can be a whole post, or more, on its own. For now, I will give a high level picture with minimal details.

Here are a list of minimal components:

The scraped movie data is stored in a PostgreSQL database on AWS RDS server. I want to access this data from my website, but I also do not want a completely open access to the public. This can be easily solved by hosting both EC2 and RDS on the same Virtual Private Cloud (VPC), and allowing access to the database only within the VPC so that my EC2 web server can access the database.

The process starts with an Ajax GET request to the web server. The web server is running Flask, a micro web server framework. It routes incoming HTTP requests, deciphers the requests, and sends HTTP responses back after doing some work. In this case, the work is a SQL query to our PostgreSQL database. The database is currently set up as a single table with columns title, distributor, release_year, worldwide_box_office (other columns omitted), where each row is information about a single movie. What I want in the end is total revenue by each distributor and release_year combination. This is pretty simple to do:

SELECT DISTINCT distributor, release_year, sum(worldwide_box_office)::bigint
FROM movie_table
GROUP BY distributor, release_year
ORDER BY distributor ASC, release_year DESC

However, I wanted the query to fill in the blanks—I wanted every distributor, release_year combination, even if the revenue is $0 (meaning the distributor did not release a movie that year). This was slightly more complicated, but here it is:

  WITH a AS (SELECT DISTINCT distributor from movie_table),
       b AS (SELECT DISTINCT release_year from movie_table)
  SELECT a.distributor, b.release_year
SELECT c.distributor, c.release_year, COALESCE(sum(worldwide_box_office)::bigint,0)
  LEFT OUTER JOIN movie_table
  ON (c.distributor = movie_table.distributor AND c.release_year = movie_table.release_year)
GROUP BY c.distributor, c.release_year
ORDER BY c.distributor, c.release_year;

This query returns a list of [distributor,release_year,revenue]. This is converted into dictionary of dictinaries ({distributor: {release_year: revenue}}), JSONified, and sent back. This is what your browser receives, which turns into a nice interactive plot above!