# Data science for parking availability prediction

## A quick insight into how I've built the engine and API dealing with the real-time analysis of parking data and occupancy prediction for the Save-a-Space project

**·**4 min read

Accelogress Ltd is a UK based software consultancy company developing mobile app solutions using data analysis, machine learning, and API technologies. Accelogress’ project, Save-a-Space, introduces a cloud-based mobile marketplace for real-time booking of parking spaces, which optimizes parking management (for parking operators) and drivers’ experience (for the end-users). Save-a-Space’s mobile app allows drivers to easily find and book the most suitable parking space, according to their personal preferences.

During my summer internship at Accelogress Ltd, I was trusted with building the engine and API that deals with the real-time analysis of parking data and occupancy prediction for their Save-a-Space project. The API would have to allow for a simple but flexible access to historical data, as well as availability predictions.

# The API

Since the prediction engine was going to be built with Python, it made sense to also build the API with some Python framework for easy integration of the two. The API was built using Django REST framework, which can quietly sit on top of a Django server. Having this integrated setup also allowed for scheduled remote updating of prediction datasets. The API allows applications and end-users to query historical data for any car park, as well as predictions of future occupancy. The Django server was mounted onto a Gunicorn which, in turn, runs on an nginx server. The whole environment runs on a Docker container, which makes it easy and quick to go from development into a production environment.

# The Prediction Engine

Thanks to Google’s scikit-learn machine
learning library, the implementation of the prediction algorithm was one
of the most straightforward parts of this project. The goal was to
predict the occupancy of a car-park at some point in the future. We had
years worth of parking data, which was important for testing different
algorithms and optimizing the ones that performed best.
I chose the *mean-squared-error*
(MSE) to be our performance metric. What we need now is to minimize
the MSE of our algorithm, as an MSE of zero represents a perfect estimator.

The algorithm can take as input parameters like the month; week-day;
hour; bank holidays (True/False); etc… and outputs the predicted
availability. The choice of input parameters affects the
performance of the algorithm. To find the optimal machine learning
*smoothie*, I ran a brute-force test with multiple algorithms for
multiple car parks with different combinations of input parameters,
with the goal of minimizing the MSE. In the end, I found the
Decision Tree Algorithm
to be the most suitable and reliable estimator. Decision trees allow
us to easily visualize the underlying model, which can be very useful to
communicate the factors that most influence the occupancy of a car park.
This can be especially useful for parking operators. Each algorithm
will have its own extra set of parameters that can be extremely useful for
minimizing the MSE. In the case of trees, we can use the *maximum-depth*
to limit its size and prevent
overfitting. After tuning
each ingredient I was able to get the MSE of our algorithm to be as low
as 0.37%!

## Working example

This is a quick working example of how to get started with the tree classifier from scikit-learn. In this first code block, I am generating some example data that we will need to train and test our chosen algorithm.

```
import numpy as np
SAMPLE_SIZE = 896 # 28 * 6
# 4 weeks range: 28 days times 2*pi radians
t_range = np.linspace(0, 56 * np.pi), SAMPLE_SIZE)
# Generate a sin wave for the occupancy percentage
occupancy = 100 * (0.5 * np.cos(t_range - np.pi) + 0.5)
# Reduce occupancy on weekends
occupancy *= np.tile([1, 1, 1, 1, 1, 0.7, 0.5], 4).repeat(SAMPLE_SIZE / 28)
# Add some random noise
occupancy *= np.random.uniform(0.95, 1, SAMPLE_SIZE)
# Round the occupancy percentage to the nearest integer
occupancy = np.rint(occupancy)
# Convert from radians to 'days' and
# clip days of week between 0 and 6
t_range_cliped = (t_range / (2 * np.pi)) % 7
weekdays = np.rint(t_range_cliped)
# transform fractional part of the day to an hour
hour = np.rint(np.modf(t_range_cliped)[0] * 24) + 1
```

The code above generates a hypothetical dataset for a car park that peaks at mid-day and has less occupancy on weekends. In this example, I generated 4 weeks of data. The image below shows the first 14 days of the generated dataset.

As you’ll see below, getting started with the sklearn library is extremely easy. This is a very simple example of how we could use the data generated above to make availability predictions in the future. In the example below the Decision Tree Classifier predicts that on Sundays at mid-day the occupancy is around 48% and on Tuesdays at 9 am is around 70%. Comparing this prediction with the plot above, we see that the tree was accurate.

```
from sklearn.tree import DecisionTreeClassifier
# Create and instance of DecisionTreeClassifier
clf = DecisionTreeClassifier()
# Input parameters (features): day-of-week, hour-of-day
input_features = np.stack([weekdays, hour], axis=-1)
# Train the classifier
clf.fit(input_features, occupancy)
# Make a prediction for:
# - Sunday at 12pm: [6, 12]
# - Tuesday at 9am: [1, 9]
prediction = clf.predict([[6, 12], [1, 9]])
print(prediction)
```