I used Google Cloud Platform to extract data from a database in BigQuery (using BigQuery API and Dataflow), analyze it using Jupyter notebook on DataLab, and trained a DNN regressor on its Cloud Machine Learning Engine.
Keywords: Google Cloud Platform (BigQuery, Dataflow, Datalab, Cloud Machine Learning), Python, SQL, Pandas, Matplotlib, TensorFlow, Deep neural networks Related projects: Titanic Survival Predictor
I’ve been meaning to learn cloud computing for some time… I also wanted to continue working on one of my previous projects that used Google BigQuery (I’ll actually do this and post about it later… It was with GDELT.)
I noticed BigQuery had a public dataset called ‘natality’ with all sorts of information about babies born in the United States. Here are the first few rows of 137.8M rows:
If you’re curious what the columns mean, you can read their description here.
Here, we will use various columns to see if we can predict what the weight of the baby will be. Other people have looked at this data too. Here’s an interesting article that looks at gestation period in days to see the distribution of births as a function of gestation days. Another blog looks at the increase in the mother’s age when she has her first child.
I also found useful guides from Google Developers Codelabs. As a part of the datalab, there’s a nicely organized Jupyter notebook, where I borrowed a lot of code from.
Something that may be a quick follow-up after this project is to predict when the person would give birth. I’m not sure if it can produce any accurate results, but it would be pretty useful information.
Anyways, let’s get started.
After many hours on BigQuery’s UI to learn how to run SQL queries and exploring the dataset, I wrote up Python scripts to automatically import the average weight and the standard deviation by a feature. This was all done on Google Cloud’s Datalab virtual machine instance, which includes in-browser Jupyter notebook (so convenient!). You can connect to it through SSH with little more work. Some features required extra tweaking, but here’s the generic SQL query:
SELECT column_name, AVG(weight_pounds) AS avg_wt, STDDEV(weight_pounds) AS std_wt FROM publicdata.samples.natality WHERE year > 2000 GROUP BY column_name
where column_name is a column name from the table, such as is_male, child_race, plurality, gestation_weeks, etc.
You can combine this with BigQuery API:
import google.datalab.bigquery as bq df = bq.Query(query).execute().result().to_dataframe()
and you end up with Pandas dataframe. I did this with various features. Here’s sample of 9:
Average weight vs. features
Average APGAR score vs. features
APGAR score is a basic 5-category health indicator showing if the baby needs any medical attention or not. I didn’t end up studying APGAR score, but rest of this study can be copied almost exactly to APGAR score instead of weight. Here’s the same plot as above, but for APGAR score (just for fun):
So some of these features are pretty relevant, although the difference is within the standard deviation for a lot of them. Boys are slightly heavier than girls, pre-mature babies are lighter, twins are lighter, and triplets are even lighter. There is some relationship with the mother’s age to the baby’s weight, too.
Something I didn’t mention earlier… You need to sign up for Google Cloud Platform and figure out how things work, which is not too bad. Follow the instructions about Datalab here. You’ll need to create a storage bucket.
When you sign up, Google gives you $300.00 free to spend over a trial period of a year. You can do this project for under $10 from that pool.
Next, we’ll download some dataset as CSV files so that we can easily manage train and evaluation sets, and be able to shuffle during read easily. This was done by using Apache Beam and Dataflow. Note that at the time of this writing, Apache Beam is still only available with Python 2.7.
If you’re interested in the code, look here for now, which is where I adapted my code from. I will clean up my code and put it on Github as well.
I’ll note one thing here. I wanted to use the cigarettes/day and drinks/week, so I made queries like the following:
SELECT IF(cigarette_use,cigarettes_per_day,0) as cigarettes_per_day, IF(alcohol_use,drinks_per_week,0) as drinks_per_week
Dataflow automatically scales the number of workers for your job. Here’s what it looks like for the query job I ran:
My job was capped at 8 workers, because I’m on a free trial. So that took about 50 minutes.
Here is a summary of the machine learning part based on wide and deep model:
“Wide features”: [state,is_male, mother_race, plurality, mother_married]
“Deep features”: [month, mother_age, gestation_weeks, mother_race (embedded), cigarettes_per_day, drinks_per_week]
Using TensorFlow’s DNNLinearCombinedClassifier. Features were added as one of the following: real_valued_column, sparse_column_with_keys, or bucketized_column. The fully connected neural network had [64, 64, 32, 32] hidden units.
You have to write out a pipeline to feed to
gcloud ml-engine jobs submit training JOBNAME. Most of it can be followed from the notebook mentioned above. My code is basically the same with some extra preprocessing, changes in some inputs, and folder names.
Here’s what you can expect to see in the Machine Learning Engine page logs:
Once the job is finished, you can view some metrics on TensorBoard by calling
from google.datalab.ml import TensorBoard TensorBoard().start('gs://YOURBUCKET/PROJECT_FOLDER/trained_model')
Here’s a screenshot of the RMSE over the 3M training steps (dark green is smoothed):
So it’s not that great at the moment, ~1lb RMSE. That’s about the same as what the GCP’s training analysis gets, so not much improvement. However, this is not finished yet. I’ll try improve these results when I get the time.
Before I go! Another nice thing about Google Cloud Platform is it’s easy to deploy your model. I’m not sure how to incorporate it with my website yet, but I’ll put it up when I learned how to do it. :)