ggplot2::geom_jitter()

Function of the Week:

geom_jitter()

In this document, I will introduce the geom_jitter() function and show what it’s for. It is one of the functions used in ggplot.

What is it for?

  • Display all points of the discrete data with the focus of categorizing groups.

  • Handling overlapped points (example: tatanic data set)

It is a useful way of handling overlapped points caused by discreteness in small datasets. This will adds small random shifts to each point when we have many points overlapped on top of each other.

We are going to discuss what geom_jitter() function does by providing an example. I will show this function by using titanic dataset. It is an R package containing data sets providing information on the fate of passengers on the fatal maiden voyage of the ocean liner “Titanic”, with variables such as economic status (class), sex, age and survival.

We learned that we can use barplots to display the data with the focus of categorizing groups. However, what if we want to make a more informative plot and show all the points of the data in one plot?

titanic <- titanic_train %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
mutate(Survived = factor(Survived),
Pclass = factor(Pclass),
Sex = factor(Sex))

titanic%>%tabyl(Survived)%>%adorn_pct_formatting()
#Without geom_jitter
titanic%>%filter(Fare!=0)%>%ggplot(aes(Survived, Fare,group=Survived,color=Survived))+geom_point()+scale_y_continuous(trans="log2")+
     scale_color_manual(values = wes_palette("IsleofDogs1"))-> plot1
plot1

  • The plain plot is only useful in showing the range of the data

The plot without geom_jitter function gives an idea of the range of the data; however, we can not see all the 549 passengers who survived and 342 passengers who did not survive. Because, many points are on top of each other and overlapped. In order to better illustrate this plot, we use geom_jitter. In other words, we minimize the number of points that fall on top of each other. Therefore, it can give us a better sense of the distribution of the data.

#With geom_jitter
titanic%>%filter(Fare!=0)%>%ggplot(aes(Survived, Fare,group=Survived,color=Survived))+geom_point()+scale_y_continuous(trans="log2")+
geom_jitter(width=0.1)+
   scale_color_manual(values = wes_palette("IsleofDogs1"))->plot2
plot2

  • Alpha bending

To better visualize the data, inaddition of using geom_jitter, using alpha arguemnt can enable to make the points somewhat transparent. The more points fall on top of each other the darker the points. This improves our sense of how points are distributed. Here is the same plot with alpha bending:

#With geom_jitter and alpha:
titanic%>%filter(Fare!=0)%>%ggplot(aes(Survived, Fare,group=Survived,color=Survived))+geom_point()+scale_y_continuous(trans="log2")+
geom_jitter(width=0.1,alpha=0.3)+
   scale_color_manual(values = wes_palette("IsleofDogs1"))

  • Adding boxplot to geom_jitter
#With geom_jitter and boxplot
titanic%>%filter(Fare!=0)%>%ggplot(aes(Survived, Fare,group=Survived,color=Survived))+
geom_boxplot()+scale_y_continuous(trans="log2")+
geom_jitter(width=0.1,alpha=0.7)+
   scale_color_manual(values = wes_palette("IsleofDogs1"))->jitter1
jitter1

  • Now we can interpret the plot by getting a sense of disribution of the data

By adding boxplot to our plot, it can be seen that the median fair for those who did not survive was lower than those who survived. Now, we can get a sense that passengers who survived generally paid more fare than those who did not survive.

  • changing the width of the points by width argument

The width of the points also can be changed via width argument.

# Change the width
titanic%>%filter(Fare!=0)%>%ggplot(aes(Survived, Fare,group=Survived,color=Survived))+
geom_point()+scale_y_continuous(trans="log2")+
geom_jitter(width=0.5,alpha=0.7)+
   scale_color_manual(values = wes_palette("IsleofDogs1"))

Is it helpful?

  • Comparing this plot, with histogram and barplot using the same data set and information

Comparing these 3 plots, geom_jitter was the most useful plot to describe the distribution. We can not visualize the same results from barplots and histograms for this data(barplots are useful to show one number). And in this case geom_jitter was more informative.

#compare these three plots with exact same data
titanic%>%
ggplot(aes(Fare, ..density..)) +
geom_histogram(binwidth = 20, fill="blue",color="black") +
facet_grid(Survived~.)->histogram

titanic%>%ggplot(aes(Survived, Fare, ))+
  geom_col(fill="darkgreen")->barplot
grid.arrange(barplot,histogram,jitter1,ncol=3)

  • Using geom_jitter() for 2 discrete variables(mpg data set)
ggplot(mpg,aes(cty,hwy))+geom_point(color="purple")+
   ylab("highway miles per gallon")+
  xlab("city miles per gallon")

ggplot(mpg,aes(cty,hwy))+geom_jitter(width=2,
color="purple")+
   ylab("highway miles per gallon")+
  xlab("city miles per gallon")

Reference:

  1. Harvard course: https://www.edx.org/bio/rafael-irizarry

  2. Introduction to Data Science, Rafael A. Irizarry