Storytelling With Data (SWD) Challenge: Your Choice Makover

2018/07/07

Introduction: SWDChallenge

On her blog and Twitter account, Storytelling With Data, Cole Nussbaumer Knaflic is running a monthly challenge on effective data visualization. Previous challenges have mainly focused on using specific graph types (e.g., square area, bar charts, slopegraphs) or elements (annotation, color). This month the challenge is to remake a less-than-ideal graph. I happen to have one in mind, so I’m jumping in to the challenge!

The less-than-ideal graph

The graph I’m going to remake comes from the Kansas City, MO, citizen satisfaction survey for 2012-2013. It shows the composite citizen satisfaction for Kansas City compared against the National score.

I happened to come across this graph as part of my day-job and it immediately stuck out to me because it’s so hostile to visualization rules-of-thumb and best-practices.

The issues I see with the graph are:

On the arbitrary zero-point, it’s unclear if the 100 value for the baseline survey denotes the same meaning for Kansas City vs. National. My reading is that the value are normalized so that each are being compared against their respective scores from 2005. So in effect what’s being shown is the change from the baseline, but expressed as 100 + change, and plotted as a bar. This seems, um, less than ideal.

Below I will approximately reproduce this chart in R and then suggest an alternative

R libraries

Here are the libraries I need

library(dplyr)
library(ggplot2)
library(tidyr)

data

For the data, I read the data off the chart and made a simple CSV file and then read it in. The file looks like,

year,location,score
2005,Kansas City,100
2010,Kansas City,108
2011,Kansas City,109
2012,Kansas City,111
2005,National,100
2010,National,92
2011,National,91
2012,National,92

and once I read it in, the data frame like,

kc_df
#> # A tibble: 8 x 3
#>    year location    score
#>   <int> <chr>       <int>
#> 1  2005 Kansas City   100
#> 2  2010 Kansas City   108
#> 3  2011 Kansas City   109
#> 4  2012 Kansas City   111
#> 5  2005 National      100
#> 6  2010 National       92
#> 7  2011 National       91
#> 8  2012 National       92

The Original Graph

Here’s my attempt at reproducing the original.

First I make a dummy x variable that will let me encode both categorical location and year as length along the x-axis.

plot_df = kc_df %>% 
  mutate(i1 = as.integer(as.factor(location)), 
         i2 = as.integer(as.factor(year))) %>% 
  mutate(dummy_x = i1 + i2 / 4 + ifelse(i1==2, 0.25, 0))
plot_df
#> # A tibble: 8 x 6
#>    year location    score    i1    i2 dummy_x
#>   <int> <chr>       <int> <int> <int>   <dbl>
#> 1  2005 Kansas City   100     1     1    1.25
#> 2  2010 Kansas City   108     1     2    1.5 
#> 3  2011 Kansas City   109     1     3    1.75
#> 4  2012 Kansas City   111     1     4    2   
#> 5  2005 National      100     2     1    2.5 
#> 6  2010 National       92     2     2    2.75
#> 7  2011 National       91     2     3    3   
#> 8  2012 National       92     2     4    3.25

Now plot the data with ggplot, fiddling the settings to approximate the original.

long_title = "Overall Composite Customer Satisfaction Index \n for 2005, 2010-11, 2011-12, and 2012-13 \n derived from the mean overall satisfaction rating for the major categories of city services that were \n assessed on the survey (base year 2005 = 100)"

lab_vec = labels=c("2005 Survey", 
                   "2010 to 2011 Survey", 
                   "2011 to 2012 Survey", 
                   "2012 to 2013 Survey")
p = plot_df %>% 
  mutate(ic=as.factor(i2)) %>% 
  ggplot(aes(x=dummy_x, y=score, fill=ic)) + 
  geom_bar(stat='identity',width = 0.25, color='black') + 
  geom_text(aes(label=score), nudge_y = 1.5, color='black') + 
  coord_cartesian(ylim=c(80, 120)) + 
  scale_fill_manual(values = c("orange", "pink", "yellow", "blue"),
                    labels = lab_vec)  +
  scale_x_continuous(breaks = c(1.75, 2.875), 
                     labels = c("Kansas City, MO", "National")) + 
  theme_bw(base_size = 14)  + 
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(size = 18)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.major.y= element_line(color = 'gray')) + 
  labs(title=long_title, x='', y='') 

print(p)

plot of chunk unnamed-chunk-5

The result isn’t a perfect match but is close enough for my purposes.

updating the bar graph

As I’ll describe below, I don’t think a bar graph is the best way to represent this data. However, if we want to stick with bars there are a few updates we could make.

less_long_title = "Composite Customer Satisfaction Index \n compared to 2005 baseline"

lab_vec = labels=c("2005 Survey", 
                   "2010 to 2011 Survey", 
                   "2011 to 2012 Survey", 
                   "2012 to 2013 Survey")
p = plot_df %>% 
  mutate(ic=as.factor(year)) %>% 
  ggplot(aes(x=ic, y=score)) + 
  geom_bar(stat='identity',width = 0.25, color='black') + 
  coord_cartesian(ylim=c(80, 120)) + 
  scale_fill_manual(values = c("orange", "pink", "yellow", "blue"),
                    labels = lab_vec)  +
  theme_bw(base_size = 14)  + 
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(size = 18)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.major.y= element_line(color = 'gray')) + 
  labs(title=less_long_title, x='Survey Year', y='Index') + 
  facet_wrap(~location, ncol=1)

print(p)

plot of chunk unnamed-chunk-6

An alternative is to use position dodge to put the location side by side

p = plot_df %>% 
  mutate(ic=as.factor(year)) %>% 
  ggplot(aes(x=ic, y=score, color=location, fill=location)) + 
  geom_bar(stat='identity', position = 'dodge') +
  theme_minimal(base_size = 18) + 
    theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(size = 18)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.major.y= element_line(color = 'gray')) + 
  labs(title=less_long_title, x='Survey Year', y='Index') + 
  labs(x="Survey Year", y="Index", title=less_long_title) + 
  coord_cartesian(ylim=c(80, 120)) + 
  scale_fill_brewer(type='qual', palette = 2)

print(p)

plot of chunk unnamed-chunk-7

We should probably start the y-axis at 0 as well

p = plot_df %>% 
  mutate(ic=as.factor(year)) %>% 
  ggplot(aes(x=ic, y=score, color=location, fill=location)) + 
  geom_bar(stat='identity', position = 'dodge') +
  theme_minimal(base_size = 18) + 
    theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(size = 18)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.major.y= element_line(color = 'gray')) + 
  labs(title=less_long_title, x='Survey Year', y='Index') + 
  labs(x="Survey Year", y="Index", title=less_long_title) + 
  coord_cartesian(ylim=c(0, 120)) + 
  scale_fill_brewer(type='qual', palette = 2)

print(p)

plot of chunk unnamed-chunk-8

remake as a line chart

To go back to what the properties of the data we’re working with are:

My “best-practices” intuition tells me we should use a line chart, grouped by location, and to subtract the baseline.

lab_df = data.frame(x=2012, y=c(8, -6), location=c('Kansas City', 'National'), 
                    stringsAsFactors = FALSE)

p = kc_df %>% 
  filter(year>2005) %>% 
  ggplot(aes(x=year, y=score-100)) + 
  geom_line(aes(group=location, color=location), size=1.5) + 
  geom_point(aes(group=location, color=location), size=2) + 
  theme_minimal(base_size = 18) + 
  geom_hline(yintercept = 0, linetype = 2) + 
  scale_color_manual(values = c("#1F78B4", "#d95f02")) + 
  labs(x='Year', y=expression(Delta*' Index'), 
       title='Composite Customer Satisfaction Index: \n Change vs 2005') + 
  scale_x_continuous(breaks = c(2010, 2011, 2012)) + 
  theme(legend.position="bottom") + 
  theme(legend.title = element_blank(), 
        plot.title = element_text(size=18, hjust = 0.5))
print(p)

plot of chunk unnamed-chunk-9

remake as a table

This data is so simple that I’m not convinced a graph helps to get insight. An additional alternative is to simply use table for the data, for example,

kc_df  %>% spread(year, score) %>% knitr::kable()
location 2005 2010 2011 2012
Kansas City 100 108 109 111
National 100 92 91 92