Thursday, May 05, 2016

My Python Capstone Project

I've done it! I've made it all the way through the Python for Everybody MOOC by the University of Michigan with the très sympa Dr Chuck (Dr Charles Severance) at the helm who took us through the highways and byways of Python and supported us with substantially complicated scaffolding. Much needed, in my case.

The last module in the 5-part course was the Capstone where we had the opportunity to do an optional project. Always game to get the most out of things, I decided to take the bull by the horns and sign up.

For the project, we had to find a data set, 'scrape' it to find some specific information, put that into a database, and finally, visualise the results. There are data sets about many different fascinating subjects such as:

  • the last words of inmates in Texas before execution since 1984
  • the "Million Base" of 2.2 million chess matches
  • a Twitter data set
  • World Health Organisation data set
  • Family food data set
  • Million Song data set

and so on. For my project however, I chose the Transport for London data set available in their Application Programming Interface (API). It provides access to real time data on the most highly requested information across all modes of transport. It also provides data on accidents across London. I wanted to find out about bicycle accidents (just because), and discover where most accidents happen. I thought it would probably be the City of London which is densely populated during the day and has high cycling activity (couriers etc.).

The first thing I had to do, I discovered, was apply for an API key giving me permission to scrape the data. Then I had to write the code, which I based on code we had seen during the course (click on the images to see them bigger and better).

This code creates the database, connects to the TfL API, asks for the year to download and inserts the longitude, latitude, severity and victim (cycle, car, motorcycle, etc.) into the database for that year. Then it saves the data and closes the connection.

It took me a few days because I wasn't sure about exactly what I needed to do - did I need to create a dictionary, or two, or none...? That's half the problem actually, for me - identifying the structure of the code you need to write for the job you want to do.

My code is really simple too. It just asks for one year, not multiple years. It assumes there are no errors in the year entered (e.g. 2016 which is not available yet). I could make it more robust, but to start with, I just wanted to make it work!

This is what the data looks like in the API:

TfL API raw data

This is what it looks like 'pretty printed':
TfL API data in readable format
You can see more clearly the information I wanted to download in the 'pretty printed' format.

This is what the database my code created looks like, it has 23116 rows of data:

I was astounded the first time the code worked and saw the database loaded with data. Someone I spoke to recently called the feeling a 'nerdy moment'. Never thought I'd ever have one of those, I must say!

Having got the data, I then had to write some code to select the cyclist accident set, choosing 'severe' accidents rather than fatal ones (too sad), iterating through the data, and writing the longitude and latitude locations only to a javascript file.

Code to select geolocation data
I was thankful to have some scaffolding to help me write that too!

The geolocations of severe cycle accidents in London

Once I had the geolocation data, I then had to visualise it. I had already used some visualisation code earlier in the MOOC, so just had adapt it to visualise my data. It actually took me three days because I ran into a problem and had no idea what to do. The code was written in html, which I know nothing about. I hunted around for a solution on the internet, including that fabulous resource Stackoverflow, but couldn't find an answer.

I was stumped. Then I moaned to my DB about my problem, and he said that I should check the latitude and longitude coordinates because they might not be in the right format. And he was right! They were back-to-front in my code! Once I'd fixed that, up popped the little red labels as they should (with another nerdy moment).
Great London severe accident sites
These maps show the severe accidents for 2008. There were 429, and you can see from the second map that the highest concentration was indeed the City of London.

Central London severe accident sites
Job done!

It's been a really satisfying few months, going from being a complete Python beginner/never having touched coding before, ever, and having been crap at maths, to producing an amazing, functioning final result that I had to understand to make work (more-or-less, let's just ignore the html...). Dr Chuck was an entertaining teacher who could engage with us across a screen (no mean feat), and who even set up live 'office hours' during the Capstone so we could interact with him directly. He was aided and abetted by a team of kindly mentors who were available to help us out and give advice in the forums.

I am very happy with the results, and aim to go on and tackle C# next!


  1. Well I have absolutely no idea what a lot of this means but the end result is impressive!
    I've been writing about Capstone today too - making notes about Capstone Hill in Ilfracombe!!

    1. Thanks Trish, it looks impressive to me too. :)

      I hope you walked to the top of Capstone Hill, and in a Buff or two... :)

  2. Well Sarah, that is one hell of an achievement. Python is a great language. I hope that you had a bottle of something really nice to celebrate?

    1. Thanks Nick, I had a glass of robust red organic wine. Very nice it was too. :)

  3. I stopped reading at about '' I've done it ! ''
    Well done if that is appropriate.

  4. As a programmer by trade I am really impressed you learned so much so quickly. Congratulations.

    1. Thank you for your kind words, Tim. :)

  5. Wow this looks so complicated but congratulations on learning it all. Have a good week Diane

    1. Hi Diane, thanks! It took a lot of gnashing of teeth. :)

  6. This makes much more sense since seeing the video. Still bleedin' awesome.


Comments are bienvenue.