Google Cloud Platform BigQuery & Machine Learning - Guessing Baby Weight

I used Google Cloud Platform to extract data from a database in BigQuery (using BigQuery API and Dataflow), analyze it using Jupyter notebook on DataLab, and trained a DNN regressor on its Cloud Machine Learning Engine.

Keywords: Google Cloud Platform (BigQuery, Dataflow, Datalab, Cloud Machine Learning), Python, SQL, Pandas, Matplotlib, TensorFlow, Deep neural networks Related projects: Titanic Survival Predictor

I’ve been meaning to learn cloud computing for some time… I also wanted to continue working on one of my previous projects that used Google BigQuery (I’ll actually do this and post about it later… It was with GDELT.)

I noticed BigQuery had a public dataset called ‘natality’ with all sorts of information about babies born in the United States. Here are the first few rows of 137.8M rows:

Row source_year year month day wday state is_male child_race weight_pounds plurality apgar_1min apgar_5min mother_residence_state mother_race mother_age gestation_weeks lmp mother_married mother_birth_state cigarette_use cigarettes_per_day alcohol_use drinks_per_week weight_gain_pounds born_alive_alive born_alive_dead born_dead ever_born father_race father_age record_weight
1 2005 2005 7 null 3 null false null 8.62889293468 1 null 9 null 78 34 41 09262004 true null false null false null 57 9 0 0 10 78 38 1
2 2005 2005 4 null 6 null true null 2.6786164833 1 null 6 null 78 36 34 09012004 true null false null false null 23 7 0 0 8 78 39 1
3 2006 2006 5 null 1 null true null 11.06279630716 1 null 9 null 68 38 41 08082005 true null false null false null 11 null null null 8 68 41 1
4 2007 2007 3 null 2 null false null 5.43659938092 2 null 9 null 78 42 38 99999999 true null false null false null 10 null null null 8 78 42 1

If you’re curious what the columns mean, you can read their description here.

Here, we will use various columns to see if we can predict what the weight of the baby will be. Other people have looked at this data too. Here’s an interesting article that looks at gestation period in days to see the distribution of births as a function of gestation days. Another blog looks at the increase in the mother’s age when she has her first child.
I also found useful guides from Google Developers Codelabs. As a part of the datalab, there’s a nicely organized Jupyter notebook, where I borrowed a lot of code from.

Something that may be a quick follow-up after this project is to predict when the person would give birth. I’m not sure if it can produce any accurate results, but it would be pretty useful information.

Anyways, let’s get started.

Data Exploration

After many hours on BigQuery’s UI to learn how to run SQL queries and exploring the dataset, I wrote up Python scripts to automatically import the average weight and the standard deviation by a feature. This was all done on Google Cloud’s Datalab virtual machine instance, which includes in-browser Jupyter notebook (so convenient!). You can connect to it through SSH with little more work. Some features required extra tweaking, but here’s the generic SQL query:

SELECT
  column_name,
  AVG(weight_pounds) AS avg_wt,
  STDDEV(weight_pounds) AS std_wt
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  column_name

where column_name is a column name from the table, such as is_male, child_race, plurality, gestation_weeks, etc.
You can combine this with BigQuery API:

import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()

and you end up with Pandas dataframe. I did this with various features. Here’s sample of 9:

Average weight vs. features

Average APGAR score vs. features
APGAR score is a basic 5-category health indicator showing if the baby needs any medical attention or not. I didn’t end up studying APGAR score, but rest of this study can be copied almost exactly to APGAR score instead of weight. Here’s the same plot as above, but for APGAR score (just for fun):

So some of these features are pretty relevant, although the difference is within the standard deviation for a lot of them. Boys are slightly heavier than girls, pre-mature babies are lighter, twins are lighter, and triplets are even lighter. There is some relationship with the mother’s age to the baby’s weight, too.

Machine Learning

Something I didn’t mention earlier… You need to sign up for Google Cloud Platform and figure out how things work, which is not too bad. Follow the instructions about Datalab here. You’ll need to create a storage bucket.
When you sign up, Google gives you $300.00 free to spend over a trial period of a year. You can do this project for under $10 from that pool.

Next, we’ll download some dataset as CSV files so that we can easily manage train and evaluation sets, and be able to shuffle during read easily. This was done by using Apache Beam and Dataflow. Note that at the time of this writing, Apache Beam is still only available with Python 2.7.

If you’re interested in the code, look here for now, which is where I adapted my code from. I will clean up my code and put it on Github as well.

I’ll note one thing here. I wanted to use the cigarettes/day and drinks/week, so I made queries like the following:

SELECT
  IF(cigarette_use,cigarettes_per_day,0) as cigarettes_per_day,
  IF(alcohol_use,drinks_per_week,0) as drinks_per_week

Dataflow automatically scales the number of workers for your job. Here’s what it looks like for the query job I ran:
My job was capped at 8 workers, because I’m on a free trial. So that took about 50 minutes.

Here is a summary of the machine learning part based on wide and deep model:

“Wide features”: [state,is_male, mother_race, plurality, mother_married]
“Deep features”: [month, mother_age, gestation_weeks, mother_race (embedded), cigarettes_per_day, drinks_per_week]
Using TensorFlow’s DNNLinearCombinedClassifier. Features were added as one of the following: real_valued_column, sparse_column_with_keys, or bucketized_column. The fully connected neural network had [64, 64, 32, 32] hidden units.

You have to write out a pipeline to feed to gcloud ml-engine jobs submit training JOBNAME. Most of it can be followed from the notebook mentioned above. My code is basically the same with some extra preprocessing, changes in some inputs, and folder names.

Here’s what you can expect to see in the Machine Learning Engine page logs:

Once the job is finished, you can view some metrics on TensorBoard by calling

from google.datalab.ml import TensorBoard
TensorBoard().start('gs://YOURBUCKET/PROJECT_FOLDER/trained_model')

Here’s a screenshot of the RMSE over the 3M training steps (dark green is smoothed):

So it’s not that great at the moment, ~1lb RMSE. That’s about the same as what the GCP’s training analysis gets, so not much improvement. However, this is not finished yet. I’ll try improve these results when I get the time.

Before I go! Another nice thing about Google Cloud Platform is it’s easy to deploy your model. I’m not sure how to incorporate it with my website yet, but I’ll put it up when I learned how to do it. :)