tidyr::drop_na()
Amy Olyaei
In this document, I will introduce the drop_na()
function and show what it’s for.
#load tidyr up
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(dplyr)
#example dataset
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.0.3
data(penguins)
What is it for?
The drop_na()
function accepts two arguments: the first is the dataset
, and the second is ...
which is the columns you want to inspect for missing values. The second argument makes this function most appropriately used within a tidy workflow and essential when wanting to specify for a specific column.
Within this first example, we can see how to use the function within a tidy workflow. The drop_na()
is removing rows from the data set based off the ‘NA’ arguments within the sex column. As one can see, the original tibble contained 12 rows; however, after applying the function it only contains 7 rows.
#Slicing the data
penguins%>%
slice(1:12)
## # A tibble: 12 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge~ 39.1 18.7 181 3750
## 2 Adelie Torge~ 39.5 17.4 186 3800
## 3 Adelie Torge~ 40.3 18 195 3250
## 4 Adelie Torge~ NA NA NA NA
## 5 Adelie Torge~ 36.7 19.3 193 3450
## 6 Adelie Torge~ 39.3 20.6 190 3650
## 7 Adelie Torge~ 38.9 17.8 181 3625
## 8 Adelie Torge~ 39.2 19.6 195 4675
## 9 Adelie Torge~ 34.1 18.1 193 3475
## 10 Adelie Torge~ 42 20.2 190 4250
## 11 Adelie Torge~ 37.8 17.1 186 3300
## 12 Adelie Torge~ 37.8 17.3 180 3700
## # ... with 2 more variables: sex <fct>, year <int>
#Dropping based on sex
penguins%>%
slice(1:12) %>%
drop_na(sex)
## # A tibble: 7 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 5 Adelie Torge~ 39.3 20.6 190 3650 male
## 6 Adelie Torge~ 38.9 17.8 181 3625 fema~
## 7 Adelie Torge~ 39.2 19.6 195 4675 male
## # ... with 1 more variable: year <int>
Within the second examples, we can see that without specifying by a column all rows containing ‘NA’ will be removed from the data set. I chose to use a tidy work flow in this example so the removing of the rows could be easily visualized; however, if the ultimate desire is to remove all the ‘NA’ arguments from the data one could simply go ‘drop_na(dataset)’.
#Dropping all Na variables
penguins%>%
slice(1:12) %>%
drop_na()
## # A tibble: 7 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 5 Adelie Torge~ 39.3 20.6 190 3650 male
## 6 Adelie Torge~ 38.9 17.8 181 3625 fema~
## 7 Adelie Torge~ 39.2 19.6 195 4675 male
## # ... with 1 more variable: year <int>
Additionally, the drop_na()
function is useful when plotting data. When the ‘drop_na()’ function is not applied to the tidy workflow one can see that the ‘NA’ values are treated as a category within the sex column. However, when drop_na()
is included the values are removed providing a better display of the data.
#Using in graph pipe
library(ggplot2)
#Keeping the NA values of sex within the graphical display
penguins %>%
count(sex, species) %>%
ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
facet_wrap(~sex)
#Removing the NA values of sex within the graphical display
penguins %>%
drop_na() %>%
count(sex, species) %>%
ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
facet_wrap(~sex)
Is it helpful?
Yes, I believe this function is helpful especially if you want to remove missing values from only a specific column.