--- title: "A real world example" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{a-real-world-example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(chronicler) ``` This is practically the same code you can find on this blog post of mine: https://www.brodrigues.co/blog/2018-11-14-luxairport/ but with some minor updates to reflect the current state of the `{tidyverse}` packages as well as logging using `{chronicler}`. Let's first load the required packages, and the `avia` dataset included in the `{chronicler}` package: ```{r parsermd-chunk-1} library(chronicler) library(dplyr) library(tidyr) library(stringr) library(lubridate) # Ensure chronicler version of `pick()` is being used pick <- chronicler::pick data("avia") ``` Now I need to define the needed functions for the analysis. To improve logging, I add the `dim()` function as the `.g` argument of each function below. This will make it possible to see how the dimensions of the data change inside the pipeline: ```{r parsermd-chunk-2} # Define required functions # You can use `record_many()` to avoid having to write everything r_select <- record(select, .g = dim) r_pivot_longer <- record(pivot_longer, .g = dim) r_filter <- record(filter, .g = dim) r_separate <- record(separate, .g = dim) r_group_by <- record(group_by, .g = dim) r_summarise <- record(summarise, .g = dim) ``` We can now start by preparing the data: ```{r parsermd-chunk-3} avia_clean <- avia %>% r_select(1, contains("20")) %>% # select the first column and every column starting with 20 bind_record(r_pivot_longer, -starts_with("freq"), names_to = "date", values_to = "passengers") %>% bind_record(r_separate, col = 1, into = c("freq", "unit", "tra_meas", "air_pr\\time"), sep = ",") ``` Let's take a look at the data: ```{r parsermd-chunk-4} avia_clean ``` The passengers column contains `":"` characters instead of `NA`s, and it's a character column. Let's convert this column to numbers: ```{r parsermd-chunk-5} r_mutate <- record(mutate, .g = dim) avia_clean2 <- avia_clean %>% bind_record(r_mutate, passengers = as.numeric(passengers)) ``` Let's look at the data: ```{r parsermd-chunk-6} avia_clean2 ``` What happened? Let's read the log to find out! ```{r parsermd-chunk-7} read_log(avia_clean2) ``` So what happened is that `as.numeric()` introduced `NA`s by coercion. This is what happens when trying to convert a character to a number, for example `as.numeric(":")` will result in an `NA`. Because `mutate()` was recorded with the default value for its `strict` argument (which is `2`), warnings get promoted to errors. This can be quite useful to avoid problems with silent conversions. But in this case, we want to ignore the warning: let's record `mutate()` with `strict = 1`, so that only errors can stop the pipeline: ```{r parsermd-chunk-8} r_mutate_lenient <- record(mutate, .g = dim, strict = 1) avia_clean2 <- avia_clean %>% bind_record(r_mutate_lenient, passengers = as.numeric(passengers) ) ``` As you can see, the warnings get printed, they're not captured. We can now take a look at the data and see that `":"` characters where successfully replaced by `NA`s: ```{r parsermd-chunk-9} avia_clean2 ``` Let’s continue and focus on monthly data: ```{r parsermd-chunk-10} avia_monthly <- avia_clean2 %>% bind_record(r_filter, freq == "M", tra_meas == "PAS_BRD_ARR", !is.na(passengers)) %>% bind_record(r_mutate, date = paste0(date, "01"), date = ymd(date)) %>% bind_record(r_select, destination = "air_pr\\time", date, passengers) ``` To make sure I only have monthly data, I can count the values of the `date` column using `dplyr::count()`. But because `avia_monthly` is not a data frame, but a `chronicle` I need to `record()` the `dplyr::count()` function. But because I only need it this once, I could instead use `fmap_record()`, which makes it possible to apply an undecorated function to a `chronicle` object: ```{r parsermd-chunk-11} fmap_record(avia_monthly, count, date) ``` `avia_monthly` is an object of class `chronicle`, but in essence, it is just a list, with its own print method: ```{r parsermd-chunk-12} avia_monthly ``` Now that the data is clean, we can read the log: ```{r parsermd-chunk-13} read_log(avia_monthly) ``` This is especially useful if the object `avia_monthly` gets saved using `saveRDS()`. People can then read this object, can read the log to know what happened and reproduce the steps if necessary. Let's take a look at the final data set: ```{r parsermd-chunk-14} avia_monthly %>% pick("value") ``` It is also possible to take a look at the underlying `.log_df` object that contains more details, and see the output of the `.g` argument (which was defined in the beginning as the `dim()` function): ```{r parsermd-chunk-15} check_g(avia_monthly) ``` ```{r parsermd-chunk-16, include = FALSE} hu <- check_g(avia_monthly)$g ``` After `select()` the data has `hu[[1]][1]` rows and `hu[[1]][2]` columns, after the call to `pivot_longer()`, `hu[[2]][1]` rows and `hu[[2]][2]` columns, `separate()` adds three columns, after `filter()` only `hu[[5]][1]` rows remain (`mutate()` does not change the dimensions) and then `select()` is used to remove three columns.