bit.ly/njt-jsm21
nj_tierney
How to make better spaghetti 🍝
How to make better spaghetti 🍝
How to explore longitudinal data effectively
Individuals repeatedly measured through time
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
Australia | 1960 | 176 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
Australia | 1960 | 176 |
Australia | 1970 | 178 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 173 |
Australia | 1920 | 173 |
Australia | 1960 | 176 |
Australia | 1970 | 178 |
🎉 Individuals repeatedly measured through time
Overplotting
🙈 We don't see the individuals
🤷 Looking at 144 plots doesn't really help
Problem #1: How do I look at some of the data?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #3: How do I understand a model?
brolgar
: brolgar.njtierney.combrowsing
over
longitudinal data
graphically, and
analytically, in
r
Individuals repeatedly measured through time
Individuals repeatedly measured through time
Individuals repeatedly measured through time
Anything that is observed sequentially over time is a time series
Anything that is observed sequentially over time is a time series
heights <- as_tsibble(heights, index = year, key = country, regular = FALSE)
1. + 2.
determine distinct rows in a tsibble.
(From Dr. Earo Wang's talk: Melt the clock)
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## # … with 1,484 more rows
We add information about index + key:
📐 Index = Year
🔑 Key = Country
We add information about index + key:
📐 Index = Year
🔑 Key = Country
Record important time series information once
Use it many times in other places
data indigestion
How many keys are there?
How many keys are there?
How many facets do I want?
How many keys are there?
How many facets do I want?
How many keys per facet?
How to keep the same number of keys per plot?
How to keep the same number of keys per plot?
What is rep
, rep.int
, and rep_len
?
How to keep the same number of keys per plot?
What is rep
, rep.int
, and rep_len
?
Do I want length.out
or times
?
(Something I made up)
(Something I made up)
If solving a problem requires solving 3+ smaller problems
Your focus shifts from the current goal to something else.
You are distracted.
Task one
Task one being overshadowed slightly by minor task 1
We can blame ourselves when we are distracted for not being better.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We need to make things as easy as reasonable, with the least amount of distraction.
facet_sample()
: See more individuals 👀ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line()
facet_sample()
: See more individuals 👀facet_sample()
: See more individuals 👀facet_sample()
: See more individuals 👀ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample()
facet_sample()
: See more individuals 👀How many keys per facet?
How many plots do I want to look at?
How many keys per facet?
How many plots do I want to look at?
gg_heights + facet_sample( n_per_facet = 3, n_facets = 9 )
gg_heights + facet_strata()
facet_strata()
: See all individualsIn asking these questions we can solve something else interesting
gg_heights + facet_strata( along = -year )
facet_sample()
"How many lines per facet"
"How many facets?"
gg_heights + facet_sample( n_per_facet = 10, n_facets = 12 )
facet_sample()
"How many lines per facet"
"How many facets?"
gg_heights + facet_sample( n_per_facet = 10, n_facets = 12 )
facet_strata()
"How many facets / strata?"
"What to arrange plots along?"
gg_heights + facet_strata( n_strata = 10, along = -year )
facet_strata()
& facet_sample()
with sample_n_keys()
& stratify_keys()
You can still get at data and do manipulations
as_tsibble()
facet_sample()
facet_strata()
as_tsibble()
facet_sample()
facet_strata()
Store useful information
View many subsamples
View all subsamples
as_tsibble()
facet_sample()
facet_strata()
Store useful information
View many subsamples
View all subsamples
A workflow
A workflow
Define what is interesting:
A workflow
Define what is interesting:
maximum height
Let's see that one more time, but with the data
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## # … with 1,484 more rows
## # A tibble: 144 x 2## country max## <chr> <dbl>## 1 Afghanistan 168.## 2 Albania 170.## 3 Algeria 171.## 4 Angola 169.## 5 Argentina 174.## 6 Armenia 172.## # … with 138 more rows
heights_five %>% filter(max == max(max) | max == min(max))
## # A tibble: 2 x 2## country max## <chr> <dbl>## 1 Denmark 183.## 2 Papua New Guinea 161.
heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country")
## # A tibble: 21 x 4## country max year height_cm## <chr> <dbl> <dbl> <dbl>## 1 Denmark 183. 1820 167.## 2 Denmark 183. 1830 165.## 3 Denmark 183. 1850 167.## 4 Denmark 183. 1860 168.## 5 Denmark 183. 1870 168.## 6 Denmark 183. 1880 170.## # … with 15 more rows
But Nick, how did you create those summaries?
heights %>% features(height_cm, feat_five_num)
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 138 more rows
feat_ranges
heights %>% features(height_cm, feat_ranges)
## # A tibble: 144 x 5## country min max range_diff iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 168. 7 3.27## 2 Albania 168. 170. 2.20 1.53## 3 Algeria 166. 171. 5.06 2.15## 4 Angola 159. 169. 10.5 7.87## 5 Argentina 167. 174. 7 2.21## 6 Armenia 164. 172. 8.82 5.30## # … with 138 more rows
feat_monotonic
heights %>% features(height_cm, feat_monotonic)
## # A tibble: 144 x 5## country increase decrease unvary monotonic## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## # … with 138 more rows
feat_spread
heights %>% features(height_cm, feat_spread)
## # A tibble: 144 x 5## country var sd mad iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 7.20 2.68 1.65 3.27## 2 Albania 0.950 0.975 0.667 1.53## 3 Algeria 3.30 1.82 0.741 2.15## 4 Angola 16.9 4.12 3.11 7.87## 5 Argentina 2.89 1.70 1.36 2.21## 6 Armenia 10.6 3.26 3.60 5.30## # … with 138 more rows
feasts
feat_acf
: autocorrelation-based features
feat_stl
: STL (Seasonal, Trend, and Remainder by LOESS) decomposition
🤔
Let's fit a mixed effects model.
Fixed effect of year + Random intercept for country
heights_fit <- lmer(height_cm ~ year + (1|country), heights)heights_aug <- heights %>% add_predictions(heights_fit, var = "pred") %>% add_residuals(heights_fit, var = "res")
## # A tsibble: 1,490 x 5 [!]## # Key: country [144]## country year height_cm pred res## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 1870 168. 164. 4.59 ## 2 Afghanistan 1880 166. 164. 1.52 ## 3 Afghanistan 1930 167. 166. 0.823## 4 Afghanistan 1990 167. 168. -1.04 ## 5 Afghanistan 2000 161. 169. -7.10 ## 6 Albania 1880 170. 168. 2.39 ## # … with 1,484 more rows
facet_sample()
facet_sample()
gg_heights_fit + facet_sample()
facet_strata()
facet_strata()
gg_heights_fit + facet_strata()
gg_heights_fit + facet_strata(along = -res)
set.seed(2020-01-21)heights_sample <- heights_aug %>% sample_n_keys(size = 9) %>% ggplot(aes(x = year, y = pred, group = country)) + geom_line() + facet_wrap(~country)heights_sample
heights_sample + geom_point(aes(y = height_cm))
facet_sample()
/ facet_strata()
More features (summaries)
Generalise beyond longitudinal data
Explore stratification process
Work with dplyr::across()
& dplyr::pick()
Slides made using xaringan
Extended with xaringanthemer
Colours modified from ochRe::lorikeet
Header font is Josefin Sans
Body text font is Montserrat
Code font is Fira Mono
BONUS ROUND
if you make a tooltip or rollover, assume no one will ever see it" -- Archie Tse, NYT
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729
Which countries are nearest to these statistics?
keys_near()
keys_near(heights_aug, var = res)
## # A tibble: 6 x 5## country res stat stat_value stat_diff## <chr> <dbl> <fct> <dbl> <dbl>## 1 Ireland -8.17 min -8.17 0 ## 2 Azerbaijan -1.62 q_25 -1.62 0.000269## 3 Laos -0.157 med -0.156 0.00125 ## 4 Mongolia -0.155 med -0.156 0.00125 ## 5 Egypt 1.35 q_75 1.35 0.000302## 6 Poland 12.2 max 12.2 0
🔑 🔑 🔑 that best match the 5 number summary.
End.
How to make better spaghetti 🍝
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |