The package ggplot2 is “an open source implementation of the grammar of graphics for R1 Wickham The Layered Grammar of Graphics.” and it is often the go-to tool for graphics in R. There are a number of resources on how to use ggplot22 One place to start might be Chapter 3 of R for data science, co-written by the author of ggplot; Chapter 3 in S. Holmes & W. Huber “Modern Statistics for Modern Biology” gives an introduction to R graphics and ggplot, reaching some fairly advance places. To be totally honest with you, I do not think that ggplot2 is perfect, but it is powerful and once you have a good command of it—having learned how to play with its functions—you can make really high quality graphs. The trick, as with any respectable tool, is to learn how to use it well, so that it enables you to do what you want, rather than being limited by its constraints.3 Laurel Stell’s webpage hosts materials from two short courses on graphics in R: Intro to ggplot2 and Advanced graphics in R. You will find it useful that the .Rmd files used to create these presentations are available. These are good examples of what it means to master a tool.
ggplot2 is part of the Tidyverse4 https://www.tidyverse.org, “an opinionated collection of R packages designed for data science.” The packages are very useful, very popular5 H. Wickham received the COPSS award in 2019, with rather difficult naming conventions and, in my opinion, not enough examples in the help entries. ggplot2, with its layered structure, where different elements of the displays are combined with “+”, allows you to obtain sophisticated graphics with remarkably little effort. This is both a good and a bad thing. It is good to be able to get to the finishing line quickly. But “professional looking” graphics are not necessarily good graphics. You still have to think carefully about what variables your are plotting and which display you are using—a “good look” might make us think we have arrived, when we are still at the beginning. Another downside is the price one has to pay for having a program that “understands so quickly what you want”: if what you want is not what the programs is written to understand, it might take a little arm-wrestling to re-direct succesfully.
I personally love the magic that happens when you add + facet_grid(...) to the current plot: ggplot2 is great for small multiples. In contrast, I do not like the default color scale: unless you have very few items, it becomes very difficoult to separate them out6 This website gives a useful display of color options in R. The default theme, which equippes every graph with a gray background, might be appropriate for screen viewing, but it is not effective for printing or slide displays, definitely lowering the data-ink ratio.
The following section is a quick intro to ggplot2 to get you started.
In order to understand how the plotting commands work, we are going to use one dataset that is available in R: mtcars
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). If you run the following chunk, a more precise description will appear in your help window.
help(mtcars)
Let’s take a look at the summaries of the data.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
As you can see, the names of the variables is very abbreviated. Most importantly, all variables are interpreted as numeric, while some are not. To avoid confusions, let’s change the variables that are not to be interpreted as real numbers to factors or ordered factors (the number of cylinder is a number, but it does not make sense to talk about a car with 6.188 cylinders)
Let’s make things a little more meaningful
mmtcars <- within(mtcars, {
vs <- factor(vs, labels = c("V-shaped", "Straight"))
am <- factor(am, labels = c("Automatic", "Manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mmtcars)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 V-shaped:18 Automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 Straight:14 Manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
Let’s also note that we have information on the car names (in the row names), but it is a bit hidden. To make this more visible, let’s turn this into one factor in the dataframe
mmtcars<-data.frame(row.names(mtcars),mmtcars)
names(mmtcars)[1]<-"Model"
mmtcars
## Model mpg cyl disp hp drat wt qsec
## Mazda RX4 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46
## Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02
## Datsun 710 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61
## Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44
## Hornet Sportabout Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02
## Valiant Valiant 18.1 6 225.0 105 2.76 3.460 20.22
## Duster 360 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84
## Merc 240D Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00
## Merc 230 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90
## Merc 280 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30
## Merc 280C Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90
## Merc 450SE Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40
## Merc 450SL Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60
## Merc 450SLC Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00
## Cadillac Fleetwood Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98
## Lincoln Continental Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82
## Chrysler Imperial Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42
## Fiat 128 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47
## Honda Civic Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52
## Toyota Corolla Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90
## Toyota Corona Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01
## Dodge Challenger Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87
## AMC Javelin AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30
## Camaro Z28 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41
## Pontiac Firebird Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05
## Fiat X1-9 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90
## Porsche 914-2 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70
## Lotus Europa Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90
## Ford Pantera L Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50
## Ferrari Dino Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50
## Maserati Bora Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60
## Volvo 142E Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60
## vs am gear carb
## Mazda RX4 V-shaped Manual 4 4
## Mazda RX4 Wag V-shaped Manual 4 4
## Datsun 710 Straight Manual 4 1
## Hornet 4 Drive Straight Automatic 3 1
## Hornet Sportabout V-shaped Automatic 3 2
## Valiant Straight Automatic 3 1
## Duster 360 V-shaped Automatic 3 4
## Merc 240D Straight Automatic 4 2
## Merc 230 Straight Automatic 4 2
## Merc 280 Straight Automatic 4 4
## Merc 280C Straight Automatic 4 4
## Merc 450SE V-shaped Automatic 3 3
## Merc 450SL V-shaped Automatic 3 3
## Merc 450SLC V-shaped Automatic 3 3
## Cadillac Fleetwood V-shaped Automatic 3 4
## Lincoln Continental V-shaped Automatic 3 4
## Chrysler Imperial V-shaped Automatic 3 4
## Fiat 128 Straight Manual 4 1
## Honda Civic Straight Manual 4 2
## Toyota Corolla Straight Manual 4 1
## Toyota Corona Straight Automatic 3 1
## Dodge Challenger V-shaped Automatic 3 2
## AMC Javelin V-shaped Automatic 3 2
## Camaro Z28 V-shaped Automatic 3 4
## Pontiac Firebird V-shaped Automatic 3 2
## Fiat X1-9 Straight Manual 4 1
## Porsche 914-2 V-shaped Manual 5 2
## Lotus Europa Straight Manual 5 2
## Ford Pantera L V-shaped Manual 5 4
## Ferrari Dino V-shaped Manual 5 6
## Maserati Bora V-shaped Manual 5 8
## Volvo 142E Straight Manual 4 2
The function ggplot() is used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object.
This is advantageous in that the code is explicit about which layers are added and the order in which they are added. The layers that we add are geometric objects, or “geoms”.
You can think of the call ggplot as taking out a piece of paper and of the call geom_something as drawing something on the piece of paper. For example, geom_bar draws a bar plot.
The first argument in ggplot is the data that is going to be visualized. We also need to describe which among the variables in the dataset are used to create the visualization. We do this with aes(), which identifies the variables and on which axes they will be displayed.
This can be included in the initialization ggplot(data=..., aes(...)) or in the geometric object geom(aes(...)).
Once this basic plot is created, many layers of refinement can be added on top of it, using the operator + followed by appropriate commands.
So, let’s start by telling R that we want to create a plot using the data in mmtcars.
ppp<-ggplot(mmtcars)
Note that nothing happens. Let’s try to look at the object ppp
ppp
Indeed, not much different from pulling out a blank page with the intention of drawing on it.
We are now going to look at some of the geometries available. Note that there are often options within these geometries that we will not explore. We will, however, often introduce some modifications to get a sense of the flexibility we have.
ppp + geom_histogram(aes(x = wt))
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
We obtain a histogram. The bins are quite narrow, and, indeed, if you run this chunk you would get a warning message suggesting you to use a better bin choice. Here is an attempt.
ppp + geom_histogram(aes(x = wt),binwidth=0.5)
Here is another way of displaying the data relative to one quantitative variable.
ppp + geom_density(aes(x = wt))
The information displayed is similar…we can think of the density plot as what we get from the histogram as the bin widths grow small, assuming that the function is smooth.
In the spirit of learning how to add layers to our plot, lets’ modify the axis labels.
ppp + geom_density(aes(x = wt)) + labs(x="Weight")
Here is yet another plotting choice for the same type of information. In addition to the label, we also choose a different overall “style” for the plot, selecting a black and white theme that is much more appropriate for printing and projecting. We are going to try to add meaningful modifications to all the remaining plots. While to save space we are going to introduce them directly, the best way to see what they do is to run comment them out from the r-chunk (which you can do putting a # in front of the + option)
ppp + geom_dotplot(aes(x = wt)) + labs(x="Weight") + theme_bw()
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.
The box plot, drawn with the geometry geom_boxplot is a very handy way of displaying the same information. Note that this time we add a title. Notice that we also modify the r chunk to change the size of the figure. The effect is best viewed in the knitted file, but roughtly, fig.width=1.5,fig.height=3 act on the ratio of height to width and the size of plotting characters and fonts, while out.width=200 specifies the actual size of the image
ppp + geom_boxplot(aes(y = wt)) + labs(y="Weight",title="Boxplot") + theme_bw()
Let’s now think about how to display a qualitative variable, as the number of cylinders.
ppp + geom_bar(aes(x=cyl)) + labs(x="# cylinders",title="Barplot") + theme_bw()
Now, let’s look at some extra features we can add. We can ask for the coloring of the bars to reflect the proportions of the cars with manual vs automatic transmission. To do this, we modify the
aestetics indicating that the bars have to be ‘filled’ with color that represent the variable ‘am’. Note that this will automatically create a legend. We can further modify the lab elements to put a clearer title to the legend (try removing fill="Transmission" from labs() to see what happens)
ppp + geom_bar(aes(x=cyl, fill = am)) +labs(x="# cylinders",title="Barplot",fill="Transmission") + theme_bw()
Note that you can also change the colors that are used in the display (indicentally the choice of red and green is not a great one, given its lack of visibility for color blind people).
To do this you need to specify a different scale for the fill.
ppp + geom_bar(aes(x=cyl, fill = am)) +labs(x="# cylinders",title="Barplot",fill="Transmission") + theme_bw() +scale_fill_manual(values=c("dodgerblue2","gold"))
In other graphs, where the color displayed is determined by the the col variable in an aes(), you will need to use scale_color_manual (see below for an example).
Sometimes it might be useful to disply also a quantitative variable with a bar plot. Typically this is the case when you want to emphatise the identity of the points. In the plot below we look at mpg per car model. Note that we rotate the labels for the names, so that they are legible. We also reduce a bit the size of the fonts (with the option base_size=8 in theme_bw()).
ppp+geom_col(aes(x=Model,y=mpg,fill=am))+theme_bw(base_size=8)+labs(x="Car Model",y="Mile per galon",title="Column plot",fill="Transmission")+ theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5))
In cases like these, however, an often better option is to simply rotate the plot
ppp+geom_col(aes(x=Model,y=mpg,fill=am))+theme_bw(base_size=8)+labs(x="Car Model",y="Mile per galon",title="Column plot",fill="Transmission")+ coord_flip()
The plot we most commonly use to assess the relation between two quantitative variables is the scatter plot. To do this, we use the geometry geom_point.
ppp + geom_point(aes(x = wt,y=mpg)) + labs(x="Weight",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()
Note that we can change the color for the points or plotting symbols (without associating them to any variables). And let’s also note how you can put a line or a smooth function through the data.
ppp + geom_point(aes(x = wt,y=mpg),shape=4,col="red") + labs(x="Weight",y="Miles per gallon", title="Fuel Efficiency") + theme_bw() + geom_smooth(aes(x = wt,y=mpg), method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
If we want shape and/or color of the points to represent an additional variable, we need to call this specification in the
aes() function that identifies all the important variables for the plot. In the following plot, we use color to track the number of cylinders, and size to represent the horse power. We also make the points a bit transparent to make sure that readability is conserved when the points overlap (this happens using alpha). Finally notice how in the r chunck specification we have elements that describe the aspect ratio and size of the graph. You should play with all of these and see how they change your outcome.
ppp + geom_point(aes(x=wt, y=mpg, size=hp, col=cyl),alpha = 0.7) +
labs(x="Weight", col="# Cylinders", size="Horse Power",
title="Fuel Efficiency", subtitle="Source: mtcars data set") + theme_bw()
Note that you could use the geom_point also when one of the variables is a factor.
ppp+geom_point(aes(x=Model,y=mpg,col=am,size=wt,shape=cyl))+theme_bw(base_size=8)+labs(x="Car Model",y="Miles per galon",col="Transmission",size="Weight",shape="# cylinders")+ coord_flip()+ scale_color_manual(values=c("dodgerblue2","gold"))
## Warning: Using shapes for an ordinal variable is not advised
We are now going to see how box plots come in handy to compare multiple distributions. We use geom_boxplot() and we are going to specify an x in the aestetics as well.
ppp + geom_boxplot(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()
Violin plots are a similar display, which is essentially a density plot turned on its side, and reflected about the vertical axis.
ppp + geom_violin(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()
Sometimes, we might want to change direction of display. We can use coord_flip() for this.
ppp + geom_violin(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw() +coord_flip()
One of the most powerful feature of ggplot is the facility to create “small multiples” plots. To do this we use facet.
In facet_grid, the argument we use is a formula with the rows (of the tabular display) on the LHS and the columns (of the tabular display) on the RHS. A formula is separated in R with the tilde character ~. A dot in the formula is used to indicate there should be no faceting on this dimension (either row or column).
ppp + geom_histogram(aes(x = hp),binwidth=10)+ facet_grid(cyl ~ .) + labs(x="Horse Power", title="Exerimenting with Facets") + theme_bw()
Note how the following graph implements the idea of ``small multiples’’, maximizing data ink and taking advantage of the gestalt principle of symmetry.
ppp + geom_point(aes(x = hp,y=mpg))+ facet_grid(cyl ~ as.factor(am)) + labs(x="Horse Power",y="Miles per gallon", title="Fuel Efficiency, Facets version") + theme_bw()
To add on to this overview of tools available in R to look at the relations between variables in a small multiple fashion, it is useful to look at ggpairs. Now, the figure that the command below creates might be a little overwhelming, but you do not necessarily have to use it for all the variables at the same time.
It is useful to see how it treats differently quantitative vs qualitative variables and how you can represent their dependence. Note that some variables, like cylinders are now coded as ordered, but one could argue for them to be quantitative, at least for this display.
ggpairs(mmtcars[,c(-1)])
For the sake of comparison, let’s take a look at how the plot changes if we revert back to the original variable codings.
ggpairs(mtcars)
To have a little more visibility, we can look at a subset of variables
ggpairs(mmtcars[,c("mpg","cyl","disp","vs")])
Finally, you want to be aware that, while facets have some advantages in optimizing scales and symmetries, they are not the only way of putting multiple plots in the same graphical window.
You can use grid.arrange to place multiple plots within the same window. For a very basic example, let’s put together a couple of graphs we already created. And you can find more examples here
p1<- ppp + geom_boxplot(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()
p2<-ppp + geom_histogram(aes(x = hp),binwidth=10)+ facet_grid(cyl ~ .) + labs(x="Horse Power", title="Exerimenting with Facets") + theme_bw()
grid.arrange(p1,p2,nrow=1)