Philadelphia Parking Violations (2017)

23 Dec 2019

R / tidytuesday / geocomputation

data visualization / tidytuesday / geospatial

Reading data and data manipulation
Plot 1: Histogram of ticket issue times (hour)
Plot 2: Day of week details
Plot 3: Mapping parking violations by zip code

A few weeks ago, I got around to participating in TidyTuesday for the first time. The data set for that week (2019-12-03) contained data related to parking violations in the city of Philadelphia during 2017.

The plots I made and the code for each are below.

Plots

Histogram of ticket issue times

Totals per day by Day of the week

Map showing total tickets per day by zip code

I just realized that I did not share the code for any of the plots that I made so I will write it up here.

Reading data and data manipulation

First, I:

read in the data using read_csv {readr}
use head {utils} to look at the first few rows of the data

library(readr)

df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-03/tickets.csv") 
head(df)

## # A tibble: 6 x 7
##   violation_desc   issue_datetime       fine issuing_agency   lat   lon zip_code
##   <chr>            <dttm>              <dbl> <chr>          <dbl> <dbl>    <dbl>
## 1 BUS ONLY ZONE    2017-12-06 12:29:00    51 PPA             40.0 -75.1    19149
## 2 STOPPING PROHIB… 2017-10-16 18:03:00    51 PPA             40.0 -75.2    19127
## 3 OVER TIME LIMIT  2017-11-02 22:09:00    26 PPA             40.0 -75.2    19127
## 4 OVER TIME LIMIT  2017-11-05 20:19:00    26 PPA             40.0 -75.2    19127
## 5 STOP PROHIBITED… 2017-10-17 06:58:00    76 PPA             40.0 -75.2    19102
## 6 DOUBLE PARKED    2017-10-02 10:40:00    51 POLICE          40.0 -75.1       NA

Next, I make the following changes to the data:

separate the issue_datetime column into Date and Time columns using separate {tidyr}
use mutate {dplyr} to:
- Make the Date column a Date object using ymd {lubridate}
- create a new column, day, indicating the day of the week (e.g. Monday, Tuesday…) each ticket was issued

library(dplyr)
library(tidyr)
library(lubridate)

df <- separate(df, issue_datetime, into = c("Date", "Time"), sep = " ") %>%
  arrange(Date, Time) %>% ungroup %>% mutate(Date = ymd(Date),
                                             day = weekdays(Date))

The first 6 rows of df now contain the following information:

violation_desc	Date	Time	fine	issuing_agency	lat	lon	zip_code	day
HP RESERVED SPACE	2017-01-01	00:00:00	301	POLICE	39.99941	-75.14233	19133	Sunday
CORNER CLEARANCE	2017-01-01	00:00:00	51	POLICE	39.95505	-75.23199	19143	Sunday
FIRE HYDRANT	2017-01-01	00:00:00	76	POLICE	39.97776	-75.24400	19151	Sunday
BLOCKING DRIVEWAY	2017-01-01	00:28:00	51	POLICE	39.97362	-75.16394	19121	Sunday
FIRE HYDRANT	2017-01-01	00:41:00	76	POLICE	40.03015	-75.08998	19124	Sunday
DOUBLE PARKED CC	2017-01-01	00:50:00	76	POLICE	39.94893	-75.16093	NA	Sunday

Plot 1: Histogram of ticket issue times (hour)

Adding new variables

For this plot, I create new variables to:

determine an offset corresponding to the day of the week
determine the hour of day (from 0-23) each ticket was issued
take the sum of the two values above so each of the 168 hours of the week is differentiated

Calculating an offset for each day

I created a data frame with the offset for each day:

day <- c("Monday", "Tuesday", "Wednesday", "Thursday",
  "Friday", "Saturday", "Sunday")
offset <- (0:6)*24

daylookup <- data.frame(day, offset)
daylookup

##         day offset
## 1    Monday      0
## 2   Tuesday     24
## 3 Wednesday     48
## 4  Thursday     72
## 5    Friday     96
## 6  Saturday    120
## 7    Sunday    144

Assigning each ticket an incident hour from 0 to 167

To assign each ticket to one of the 168 hours of the week, I carried out the following steps:

create a new variable hr by splitting the existing variable Time
convert hr into a numeric variable
join the main data set to the table of offsets, daylookup
create a new variable, incident_hr, of values 0-167 corresponding to the hour of the week.

df_hms <- separate(df, Time, into = c("hr", "minute", "second")) %>%
  mutate(hr = as.numeric(hr)) %>% left_join(daylookup) %>%
  mutate(incident_hr = offset+hr)

Making the plot

Now, I can make the plot using ggplot {ggplot2} using a histogram geometry.

library(ggplot2)
library(scales)

ggplot(data  = df_hms, aes(x = incident_hr)) +
  geom_histogram(fill = "royalblue", colour = "black", bins = 168) +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(label = comma) + xlab("") + ylab("") + theme_bw() +
  ggtitle("Philadelphia Parking Tickets by Time of Issue (2017)") +
  theme(plot.title = element_text(hjust = 0.5))

Annotating the plot

Then I annotated the plot by adding:

vertical lines to separate the days of the week using geom_vline {ggplot2}
text to insert the day names using annotate {ggplot2}

ggplot(data  = df_hms, aes(x = incident_hr)) +
  geom_histogram(fill = "royalblue", colour = "black", bins = 168) +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(label = comma) + xlab("") + ylab("") + theme_bw() +
  ggtitle("Philadelphia Parking Tickets by Time of Issue (2017)") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_vline(xintercept = 23.5, linetype = "dashed") +
  geom_vline(xintercept = 47.5, linetype = "dashed") +
  geom_vline(xintercept = 71.5, linetype = "dashed") +
  geom_vline(xintercept = 95.5, linetype = "dashed") +
  geom_vline(xintercept = 119.5, linetype = "dashed") +
  geom_vline(xintercept = 143.5, linetype = "dashed") + 
  annotate("text", x = 12, y = 27000, label = "Mon") +
  annotate("text", x = 36, y = 27000, label = "Tue") +
  annotate("text", x = 60, y = 27000, label = "Wed") +
  annotate("text", x = 84, y = 27000, label = "Thu") +
  annotate("text", x = 110, y = 27000, label = "Fri") +
  annotate("text", x = 134, y = 27000, label = "Sat") +
  annotate("text", x = 158, y = 27000, label = "Sun")

Plot 2: Day of week details

As background work to create this plot, I:

create a 0-1 variable newday to flag all the rows of the data where the date is not the same as the date of the row above it. This identifies the start of the records for each day because the data has already been sorted by datetime.
calculate the day of the year, daynum for each row by taking the cumulative sums of the newday column
determine the week of the year by taking the remainder when daynum is divided by 7 (and adding 1 so the weeks start at 1 and not 0).

df$newday <- ifelse(df$Date == lag(df$Date), 0, 1)
df$newday <- ifelse(is.na(df$newday), 1, df$newday)

df$daynum <- cumsum(df$newday)
df$week <- ((df$daynum - 1) %/% 7) + 1

Finding total tickets per day

Next I did a few more data manipulation operations:

aggregating total tickets per day over the week and day variables
make day a factor with levels corresponding to the days of the week (Monday to Sunday)

agg_week <- group_by(df, week, day) %>% summarise(Tickets = n())
aw <- agg_week
aw$day <- factor(aw$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                    "Friday", "Saturday", "Sunday"))

Making a faceted line chart

Now to make a ggplot {ggplot2} with * with a line geometry * faceted on the week variable

library(ggplot2)
library(scales)

wkday_plt <- ggplot(data = aw, aes(x = week, y = Tickets)) + geom_line() +
  facet_wrap(~day, scales="free_x") + theme_minimal() +
  ggtitle("Philadelphia Parking Violations (2017)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(label = comma) + xlab("") + ylab("")

wkday_plt

Alternative way to make a very similar chart

After making the chart, I realized that I could have made essentially the same chart simply by using Date as the x variable input. The only difference would be that the x-axis values would be dates instead of the week numbers. (I remove the dates on the x-axis anyway below.)

df_agg_day <- group_by(df, Date, day) %>% summarise(Tickets = n()) %>% ungroup()
df_agg_day$day <- factor(df_agg_day$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                                    "Friday", "Saturday", "Sunday"))

wkday_plt2 <- ggplot(data = df_agg_day, aes(x = Date, y = Tickets)) + geom_line() +
  facet_wrap(~day) + theme_minimal() +
  ggtitle("Philadelphia Parking Violations (2017)") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank()) +
  scale_y_continuous(label = comma) + xlab("") + ylab("")

wkday_plt2

Making an interactive chart for the web with plotly

In the time since I originally made this plot, I have been using the plotly library. I think this is a good opportunity to demonstrate how easily it allows you to make an interactive chart for the web.

It’s really very simple:

create a ggplot
use the plot created as input to ggplotly

library(plotly)
ggplotly(wkday_plt2)

You can now hover over the chart and see the total tickets issued on a specific day and see the exact date. It’s a lot of functionality for very little effort.

Plot 3: Mapping parking violations by zip code

The key steps for making this plot are:

finding the average number of tickets per day in each zip code
putting the zip codes into groups
merging the numerical data to geospatial data corresponding to the shapes of the different zip codes represented in the data

Calculating tickets per day and grouping zip codes

use group_by {dplyr} and summarise {dplyr} to find the average number of tickets per day. (And don’t forget to ungroup with ungroup {dply} when you are done.)
add a new variable, Tickets per day to bucket the zip codes into 5 ranges

df_zip <- group_by(df, zip_code) %>% 
  summarise(Tickets = n()/365) %>% 
  ungroup() %>% 
  mutate(`Tickets per day` = ifelse(Tickets < 10, "< 10", 
                                    ifelse(Tickets < 100, "10 to 99",
                                           ifelse(Tickets < 200, "100 to 199",
                                                  ifelse(Tickets < 300, "200 to 299", "300+")))))

I really should learn how to avoid so many nested ifelse statements. I’m sure R – or a widely-used package – has a case statement.

Reading in shapefile (.shp)

I was able to get a shapefile for the zip codes of Philadelphia from OpenDataPhilly.

To read it in, I used the readOGR {rgdal} function which reads the shapefile into a Spatial vector object.

library(rgdal) # will load sp package as well 
philly <- readOGR("Zipcodes_Poly/Zipcodes_Poly.shp")

Looking at the created object, philly in the Viewer window we can see that the CODE field (contained in data) contains the 48 zip codes corresponding to the 48 polygons in polygons. Viewing Spatial Object in Viewer window

Merging shapefile with ticket frequency data

Next we can merge the data frame, df_zip (containing the average tickets per day for each zip code) with the Spatial object, philly (containing the specification of the polygons for the zip codes).

We can do this using merge {sp}. I load the sf package as I will be using another function from it immediately after.

library(sf)
philly <- merge(philly, df_zip, by.x = "CODE", 
                        by.y = "zip_code")

Converting to an sf object

Now we convert philly to an sf (Simple Features) object which will work very easily with ggplot {ggplot2}.

phi_sf <- st_as_sf(philly)

Making the plot

Actually making the plot with ggplot is very simple: just use the geom_sf {ggplot2} to use a simple features geometry and indicate that Tickets per day is the variable we want to use as the fill color for each zip code.

library(ggplot2)
ggplot() + geom_sf(data = phi_sf, aes(fill = `Tickets per day` )) + theme_minimal() +
  ggtitle("Philadelphia Parking Violations (2017)") + 
  theme(plot.title = element_text(hjust = 0.5))

Strictly speaking, these per day averages may be incorrect as approximately 14% of the tickets in the raw data have no corresponding zip code. So these averages are really lower bounds for 2017.

So that’s it. The plots I made for TidyTuesday on 2019-12-03. Until next time.