Visualizing city of Toronto Resturant licenses

Visualizing city of Toronto Resturant licenses

Open Data ETL From Raw CSV format to Google maps visualization

So, assuming the role of a data activist, I will be analyzing the license issuance data for the city (maybe to see the effect of certain urban changes on restaurants opening )

Why restaurants  ?, I think they do represent quite a good indicator of spending and the health of the economy in general. just look at the difference between 1994 when the city issued 569 licenses to 2014 where the number shot up to 1519
So in this post we will discuss the steps I went through to transfer the CSV data available from the city into a heat map of city licenses in the city.
I have actually done this for the period from 1980 until  2015 with 5 years intervals and the result was the following Video

Downloading the Code

The Code is available on GitHub

Data Acquisition

This may look like the easiest part, just go to the City of Toronto Portal and you can easily Acquire the file which is called ‘Business Licenses.csv’.

 If the data you require is not available on the city web site, you will have to contact the Open Data Office to see how can you get access to this data, Data Acquisition is a hurdle in any Analytics projects, if only by for the fact that all other project phases depend on it.

Another file we will need is a file that contains all Canadian Postal Codes and their Latitude and Longitude (I contemplated using google maps API’s but the sheer number of Postal code calls made me realize I will run my 2400 calls per day quota in no time).
Once you Acquire the data , you can use the Code in PostalCodeBuilder.ipynb notebook to isolate the ‘ontario.csv’ from all of Canada which will make searches 3 times faster (and believe me you will need it !).
If you pull the whole project from github you can also find the ontario.csv included.
This is the code involved (using Pandas, numpy, etc.. you can see the details in github)

Munging the Data

munging_data

Preparing the data for analytics.

The first step is to try to minimize the size of the files we are handling and focus on the needed data.

  • Extract ‘Ontario’ only postal code to Lang/lat mappings from the Canada.csv
  • Extract only the fields required for processing from the business licenses files (license number, Issue Data, Cancel Date, Name, Postal Code).

Handling missing Data

The next step is to start formatting the data properly.

  • Change the Postal Code format from ‘X1X Y1Y’ to ‘X1XY1Y’ to make it easier to match against Ontario.csv
  • Extract the year field from the Issue/Cancel date and put it in full format ‘6/12/95′ to ‘1995’
  • Remove Null records (records that has no postal code or issue date).

Merge Data sources

Finally map the ontario CSV to create the final data format.

source_target

Finally the ‘ready_data.csv’ is created and can be processed to build the heatmaps for any year you choose.

Civic Engagement and Community Activism in the age of Big Data

Open Data and the city of Toronto.

Toronto joined a host of other Canadian and international cities that is posting data related to the city for the public on the Toronto Open Data Portal, This is becoming an increasingly important topic just look at the following news in the past few months alone :
So the city has started encouraging third parties to use its openly published data, most notable example is the TTC busses and trains data that is currently used multiple mobile Apps.
Further more the city encouraged the community to get involved, and this is going to be the topic of my next few blogs and paper.

We the people.

Imagine if you will, a community that wants to reduce speed limits in their streets or maybe is concerned about the size of a mega condo unit being planned or the presence of a new mega store at the heart of their area.
Any of these and other events could have a big effect on the quality of life in the neighbourhood, and the big business behind a project will come armed with ‘paid’ expert opinions and studies to support their case.

The goal of this work and my hypothesis is that we can use open data (traffic, licenses, accidents, weather, etc.) to give voice to the voiceless, to help those who need help by providing them with the data that supports their well being, the availability of such data will also isolate the rational and emotional resistance and/or support for many decisions, paving the road to a smooth process of community engagement in many projects.

The anticipated users will be

  • Community organizers
  • Campaigners (Political, Social)
  • Individuals
  • Local small business owners
  • School boards
  • Local event boards
  • ..

Challenges for Open Data Providers

Government open data does face a lot of challenges from regulations to considerations of safety and privacy, etc. but the municipal level of government can have some specific challenges.
  • Limited resources (compared to Provincial and Federal levels of government).
  • Heightened privacy concerns, as the small size of the data set could expose personal information, specially in municipalities with small population (so maybe more in Georgetown or Woodstock compared to Toronto or London).
  • The need to not just make more data available but to budget and acquire new sets of data.
  • The need to establish a process for correcting problems found in the data (null values, missing dates, incorrect postal codes, etc.).

Civic Engagement effect on Government Open Data

The topic of big data is moving from the hype stage slowly into main stream, Public Data still  in and on itself deserves a closer look at some of its attributes.
One of the most intriguing attributes of public data is that so far the type, quality and size of data available is a bottom up/ inside out process, where the city decides what data maybe useful to the public and takes input from the technical startup community.
Once the public starts using the data a new channel of feedback will start to flow with requests focusing on :
  • Quality
    • Field expansion.
    • Data integrity issues.
    • ..
  • Availability
    • Missing data.
    • New Data acquisition (I just realized that the pedestrian/traffic data is collected at an intersection once a year ?! ).
  • Context
    • As the public starts using data, New contexts will appear as a result of mixing data sets (Can we graph Federal interest rates, household debt and number of new business licenses issued ? ).
    • Those results could pose a challenge as they may require some co-ordination between different levels of governments.
    • Some of those contexts may pose threats to privacy, security and/or regulation so constant revision maybe needed.
In response to those challenges the city may need to partner with tech providers and private sector and the local tech community to provide ideas on how to fill the gaps and provide the best data assets to the public.

Data Activism !

So now we have the data, but how does one provide this data in a way to help the community there are generally two types of approaches and .. well a hybrid third option.
    1. Ad-Hoc approach.
      • In this approach the data is acquired and searched for a specific topic
      • In the next post of this series I will use this approach to study new business licenses in Toronto in a certain neighbourhood, finding out how many business opened in the area through the years. which could be used to prove the effect of certain events on a neighbourhood business quality.
      • This approach is perfect for certain small targeted issues, such as zoning, speed limits, or even city councillor level campaigning.
      • This graph shows the number of restaurant licenses issued from 1990 to date and is built using IPython notebook , Pandas and matplotlib.
    2. Software Service approach.
    • In this approach the data is collected en-masse and is manipulated and hosted and made available for the public.
    • For the example above, the data will be provided on a web interface where the user can see a heat-map of the city and the business licenses opened at a certain year or have the option to pick a neighbourhood and choose range of years to search the number of business licenses.
    • A variant of that is the TTC apps currently available by 3rd party vendors using TTC data from the city Toronto (although in this case the data is collected at run-time and on request).
    • This type of undertaking is large and unless the site providing it has some revenue stream from future traffic this type of application is generally hard to be done on volunteer bases.
  1. Hybrid.
    • In this approach the developers will pick certain sets of data that they are interested in and provide them with a certain level of customization available to the user.
    • So for example you can provide data about accidents reports and allow the user to choose their data set on a map.
    • A great example of this is the wonderful work at http://censusmapper.ca/, They are probably the original ‘Data Activists’ on the federal level, providing some of the data to any consumer who wishes to view it on a federal level as a way to highlight the importance of the full-form census.

Conclusion

Big Data Analytics is becoming an essential tool for decision making in every business and level of government, it is about time this power is handed to the public in the most suitable way, from community to federal, from schools to political campaigning, Open Data mixed with new technologies and a little bit of community give-back will reshape the face of civic engagement and community campaigning in the future.

Coming Soon to a notebook near you !

A detailed Blog on using IPython notebooks to analyze some City of Toronto new licensing data.