tidyr::fill()

Jessica Tran

In this document, I will introduce the fill function and show why it’s useful.

#load tidyverse up
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

What is it for?

The Problem:
Missing values.

What Fill does:
Fills missing values with the previous entry. It’s like putting in values for a "" ditto mark.

Why would I need this function?
Fixes redundant info. Sometimes in Excel, people don’t fill in the data if it’s redundant, which needs to be fixed before working the the data in R.

Arguments

Data: A data frame
Command: Fill(name of column of interest)
.direction: Direction to fill missing values. “Down” is default. Other options: “up”, “downup” (down, then up), or “updown” (up, then down).
Use context to identify which direction.

Examples

#load the palmerpenguins library
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.0.3
#load the dataset using data()
data(penguins)
penguins
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge~           39.1          18.7              181        3750
##  2 Adelie  Torge~           39.5          17.4              186        3800
##  3 Adelie  Torge~           40.3          18                195        3250
##  4 Adelie  Torge~           NA            NA                 NA          NA
##  5 Adelie  Torge~           36.7          19.3              193        3450
##  6 Adelie  Torge~           39.3          20.6              190        3650
##  7 Adelie  Torge~           38.9          17.8              181        3625
##  8 Adelie  Torge~           39.2          19.6              195        4675
##  9 Adelie  Torge~           34.1          18.1              193        3475
## 10 Adelie  Torge~           42            20.2              190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

We will use a different dataset since penguins doesn’t have many missing values.

Continuous Variables

Note: Tibble is a tidy data frame. It makes data easier to view.

sales <- tibble::tribble(
  ~quarter, ~year, ~sales,
  "Q1",    2000,    66013,
  "Q2",      NA,    69182,
  "Q3",      NA,    53175,
  "Q4",      NA,    21001,
  "Q1",    2001,    46036,
  "Q2",      NA,    58842,
  "Q3",      NA,    44568,
  "Q4",      NA,    50197,
  "Q1",    2002,    39113,
  "Q2",      NA,    41668,
  "Q3",      NA,    30144,
  "Q4",      NA,    52897,
  "Q1",    2004,    32129,
  "Q2",      NA,    67686,
  "Q3",      NA,    31768,
  "Q4",      NA,    49094
)

In the above dataset, the ‘year’ is missing in the subsequent quarters, so we will use the default setting for fill().

# 'fill()' defaults to replacing missing data from top to bottom
# set the column of interest as the argument 
sales %>% fill(year)
## # A tibble: 16 x 3
##    quarter  year sales
##    <chr>   <dbl> <dbl>
##  1 Q1       2000 66013
##  2 Q2       2000 69182
##  3 Q3       2000 53175
##  4 Q4       2000 21001
##  5 Q1       2001 46036
##  6 Q2       2001 58842
##  7 Q3       2001 44568
##  8 Q4       2001 50197
##  9 Q1       2002 39113
## 10 Q2       2002 41668
## 11 Q3       2002 30144
## 12 Q4       2002 52897
## 13 Q1       2004 32129
## 14 Q2       2004 67686
## 15 Q3       2004 31768
## 16 Q4       2004 49094

Categorical Variables

# Value (pet_type) is missing above
tidy_pets <- tibble::tribble(
  ~rank, ~pet_type, ~breed,
  1L,        NA,    "Boston Terrier",
  2L,        NA,    "Retrievers (Labrador)",
  3L,        NA,    "Retrievers (Golden)",
  4L,        NA,    "French Bulldogs",
  5L,        NA,    "Bulldogs",
  6L,     "Dog",    "Beagles",
  1L,        NA,    "Persian",
  2L,        NA,    "Maine Coon",
  3L,        NA,    "Ragdoll",
  4L,        NA,    "Exotic",
  5L,        NA,    "Siamese",
  6L,     "Cat",    "American Short"
)

In the above dataset, the ‘pet_type’ is missing values. From the values in the ‘breed’ column we see that we must fill from bottom to the top.

# For values that are missing above you can use `.direction = "up"`
tidy_pets %>%
  fill(pet_type, .direction = "up")
## # A tibble: 12 x 3
##     rank pet_type breed                
##    <int> <chr>    <chr>                
##  1     1 Dog      Boston Terrier       
##  2     2 Dog      Retrievers (Labrador)
##  3     3 Dog      Retrievers (Golden)  
##  4     4 Dog      French Bulldogs      
##  5     5 Dog      Bulldogs             
##  6     6 Dog      Beagles              
##  7     1 Cat      Persian              
##  8     2 Cat      Maine Coon           
##  9     3 Cat      Ragdoll              
## 10     4 Cat      Exotic               
## 11     5 Cat      Siamese              
## 12     6 Cat      American Short

Is it helpful?

Yes! Addressing missing data is important for cleaning data before analysis. And we don’t want to delete missing data because it leads to information loss and can lower statistical power.