tidyr::fill()
Jessica Tran
In this document, I will introduce the fill function and show why it’s useful.
#load tidyverse up
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
What is it for?
The Problem:
Missing values.
What Fill does:
Fills missing values with the previous entry. It’s like putting in values for a "" ditto mark.
Why would I need this function?
Fixes redundant info. Sometimes in Excel, people don’t fill in the data if it’s redundant, which needs to be fixed before working the the data in R.
Arguments
Data: A data frame
Command: Fill(name of column of interest)
.direction: Direction to fill missing values. “Down” is default. Other options: “up”, “downup” (down, then up), or “updown” (up, then down).
Use context to identify which direction.
Examples
#load the palmerpenguins library
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.0.3
#load the dataset using data()
data(penguins)
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge~ 39.1 18.7 181 3750
## 2 Adelie Torge~ 39.5 17.4 186 3800
## 3 Adelie Torge~ 40.3 18 195 3250
## 4 Adelie Torge~ NA NA NA NA
## 5 Adelie Torge~ 36.7 19.3 193 3450
## 6 Adelie Torge~ 39.3 20.6 190 3650
## 7 Adelie Torge~ 38.9 17.8 181 3625
## 8 Adelie Torge~ 39.2 19.6 195 4675
## 9 Adelie Torge~ 34.1 18.1 193 3475
## 10 Adelie Torge~ 42 20.2 190 4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
We will use a different dataset since penguins doesn’t have many missing values.
Continuous Variables
Note: Tibble is a tidy data frame. It makes data easier to view.
sales <- tibble::tribble(
~quarter, ~year, ~sales,
"Q1", 2000, 66013,
"Q2", NA, 69182,
"Q3", NA, 53175,
"Q4", NA, 21001,
"Q1", 2001, 46036,
"Q2", NA, 58842,
"Q3", NA, 44568,
"Q4", NA, 50197,
"Q1", 2002, 39113,
"Q2", NA, 41668,
"Q3", NA, 30144,
"Q4", NA, 52897,
"Q1", 2004, 32129,
"Q2", NA, 67686,
"Q3", NA, 31768,
"Q4", NA, 49094
)
In the above dataset, the ‘year’ is missing in the subsequent quarters, so we will use the default setting for fill().
# 'fill()' defaults to replacing missing data from top to bottom
# set the column of interest as the argument
sales %>% fill(year)
## # A tibble: 16 x 3
## quarter year sales
## <chr> <dbl> <dbl>
## 1 Q1 2000 66013
## 2 Q2 2000 69182
## 3 Q3 2000 53175
## 4 Q4 2000 21001
## 5 Q1 2001 46036
## 6 Q2 2001 58842
## 7 Q3 2001 44568
## 8 Q4 2001 50197
## 9 Q1 2002 39113
## 10 Q2 2002 41668
## 11 Q3 2002 30144
## 12 Q4 2002 52897
## 13 Q1 2004 32129
## 14 Q2 2004 67686
## 15 Q3 2004 31768
## 16 Q4 2004 49094
Categorical Variables
# Value (pet_type) is missing above
tidy_pets <- tibble::tribble(
~rank, ~pet_type, ~breed,
1L, NA, "Boston Terrier",
2L, NA, "Retrievers (Labrador)",
3L, NA, "Retrievers (Golden)",
4L, NA, "French Bulldogs",
5L, NA, "Bulldogs",
6L, "Dog", "Beagles",
1L, NA, "Persian",
2L, NA, "Maine Coon",
3L, NA, "Ragdoll",
4L, NA, "Exotic",
5L, NA, "Siamese",
6L, "Cat", "American Short"
)
In the above dataset, the ‘pet_type’ is missing values. From the values in the ‘breed’ column we see that we must fill from bottom to the top.
# For values that are missing above you can use `.direction = "up"`
tidy_pets %>%
fill(pet_type, .direction = "up")
## # A tibble: 12 x 3
## rank pet_type breed
## <int> <chr> <chr>
## 1 1 Dog Boston Terrier
## 2 2 Dog Retrievers (Labrador)
## 3 3 Dog Retrievers (Golden)
## 4 4 Dog French Bulldogs
## 5 5 Dog Bulldogs
## 6 6 Dog Beagles
## 7 1 Cat Persian
## 8 2 Cat Maine Coon
## 9 3 Cat Ragdoll
## 10 4 Cat Exotic
## 11 5 Cat Siamese
## 12 6 Cat American Short
Is it helpful?
Yes! Addressing missing data is important for cleaning data before analysis. And we don’t want to delete missing data because it leads to information loss and can lower statistical power.