-
Notifications
You must be signed in to change notification settings - Fork 1
/
00_04_ggplot_practical.qmd
221 lines (162 loc) · 7.02 KB
/
00_04_ggplot_practical.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "P03. Introduction to `ggplot`"
---
## Part 1: single outbreak
Consider an SIR model. A CSV file has been provided for each of the three populations in the model: S - susceptible, I - infectious, R - recovered. The model used to simulate the disease is
```{r}
f_sir <- function(time, state, parameters) {
with(as.list(c(state, parameters)), {
infections <- beta*S*I
deaths <- gamma*I
dS <- -infections
dI <- infections - deaths
dR <- deaths
return(list(c(dS, dI, dR)))
})
}
```
where beta is the transmission rate (per person, per day) and gamma is the recovery rate (by day). This practical is only for plotting - so you don't necessarily need to fully understand this function in order to move on.
### Preparing data for plotting
#### a) Load the relevant packages and read in the CSV file
The CSV file may be downloaded from [here](data/beta_1.56756_gamma_0.36508.csv).
```{r eval = F}
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
all_dat <- read_csv("beta_1.56756_gamma_0.36508.csv")
all_dat
```
```{r echo = F, eval = F}
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
all_dat <- read_csv("data/beta_1.56756_gamma_0.36508.csv")
all_dat
```
should return four columns: \| time \| S \| I \| R \|
```{r eval = F}
all_dat_long <- pivot_longer(all_dat,
cols = c(S, I, R),
names_to = "state",
values_to = "proportion")
all_dat_long
```
should return three columns: \| time \| state (key) \| proportion (value) \|
### Plotting reshaped data
#### b) With your long data frame, make a plot that shows how the proportion of the population in each state changes over time.
Use the line geometry and color each line by state. You might find the following link useful <https://ggplot2.tidyverse.org/reference/index.html>
```{r eval = F}
ggplot(data = all_dat_long,
aes(x = time,
y = proportion)) +
geom_line(aes(color = <YOUR CODE HERE>))
```
#### c) Modify the levels of the state variable to be in S,I,R order rather than in alphabetical order.
Copy and paste the code from the previous plot and re-run it to see how the ordering of the states has changed
```{r eval = F}
all_dat_long$state <- factor(all_dat_long$state,
levels = c("S", "I", "R"))
plot_SIR <- ggplot(data = all_dat_long,
aes(<YOUR CODE HERE>) +
geom_line(<YOUR CODE HERE>)
plot_SIR
```
#### d) Add labels to the x and y axes, change to theme_bw() and set the position of the legend to be at the bottom of the plot
```{r eval = F}
plot_SIR +
theme_bw() +
xlab("<YOUR CODE HERE>") +
ylab("<YOUR CODE HERE>") +
theme(legend.position = "bottom")
```
#### e) Instead of colour, use small multiples to show how each state changes over time
```{r eval = F}
plot_SIR_faceted <- ggplot(data = all_dat_long,
aes(x = time,
y = proportion)) +
geom_line() +
theme_bw() +
xlab("Time (days)") +
ylab("Population proportion") +
facet_wrap(facets = vars(state))
plot_SIR_faceted
```
## Part 2: 100 outbreaks
We are still considering an outbreak of a disease with S, I, and R. But this time, we are working with 100 simulations. This time, the file you are going to work with can be downloaded [here](data/100_simulations_wide.csv).
### Preparing data for plotting
#### a) Read in the 100 simulation data set
```{r eval = F}
#
all_dat_100 <- read_csv("100_simulations_wide.csv")
all_dat_100
```
should return five columns: \| sim \| time \| S \| I \| R \|
```{r eval = F}
all_dat_100_long <- pivot_longer(all_dat_100,
cols = <YOUR CODE HERE>,
names_to = <YOUR CODE HERE>,
values_to = <YOUR CODE HERE>)
all_dat_100_long$state <- <YOUR CODE HERE>
all_dat_100_long
```
should return four columns: \| sim \| time \| state (key) \| proportion (value) \|
#### b) Plot all 100 simulations from the SIR model
Hint: you will need to use the group aesthetic with your line and may choose to set the lines to be semi-transparent. Use faceting (small multiples) to show each state in its own subplot; faceting will be easier in all plots from this point on than colouring by state.
```{r eval = F}
ggplot(data = all_dat_100_long,
aes(x = time,
y = proportion)) +
geom_line(aes(group = sim), alpha = 0.05) +
theme_bw() +
facet_grid(<YOUR CODE HERE>) +
xlab("Time(days)") +
ylab("Proportion of population")
```
### Calculating summary statistics
#### c) Use the `group_by()` function to tell R that we want to calculate summary statistics for each state at each time
```{r eval = F}
all_dat_100_long_grouped <- group_by(all_dat_100_long, state, time)
all_dat_100_long_grouped
```
should return four columns: \| sim \| time \| state (key) \| proportion (value) \|
should also let you know there are 303 groups
#### d) Use the summarise function to calculate the median and 95% interval for each state at each time point.
You will need to calculate all three summary statistics separately, and will do so within the same `summarise()`
```{r eval = F}
all_dat_100_long_summarised <-
summarise(all_dat_100_long_grouped,
q0.025 = quantile(proportion, probs = 0.025),
q0.500 = quantile(proportion, probs = 0.5),
q0.975 = quantile(proportion, probs = 0.975))
```
#### e) Use `geom_ribbon` to plot the 95% interval and geom_line to plot the median for each state.
For its aesthetics, geom_ribbon requires a ymin and ymax and can be coloured and filled and made semi-transparent. Ensure you label the axes appropriately.
```{r eval = F}
ggplot(data = all_dat_100_long_summarised,
aes(x = time)) +
geom_ribbon(aes(ymin = q0.025, # lower edge of ribbon
ymax = q0.975), # upper edge of ribbon
alpha = 0.5, # make semi-transparent
fill = "lightskyblue", # fill blue
color = NA) + # no border color
geom_line(aes(y = q0.500)) + # line for median
facet_grid(cols = vars(state)) +
theme_bw() + # nicer theme
xlab("Time (days)") + # human friendly axis label
ylab("Population") # human friendly axis label
```
#### f) If you have time left, you may wish to investigate visualising all 100 simulations, colouring by state, as before, and faceting by simulation
```{r eval = F}
ggplot(data = all_dat_100_long,
aes(x = time)) +
geom_line(aes(y = proportion, color = state)) +
facet_wrap(facets = vars(sim)) +
theme_bw() + # nicer theme
xlab("Time (days)") + # human friendly axis label
ylab("Population") + # human friendly axis label
theme(legend.position = "bottom")
```
#### g) Discuss whether you think faceting by state or simulation gives a clearer understanding of how the simulations vary
Solutions can be accessed [here](00_04_ggplot_solutions.qmd).