Exploring ggplot

In order to understand how the plotting commands work, we are going to use one dataset that is available in R: mtcars The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). If you run the following chunk, a more precise description will appear in your help window.

help(mtcars)

Let’s take a look at the summaries of the data.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

As you can see, the names of the variables is very abbreviated. Most importantly, all variables are interpreted as numeric, while some are not. To avoid confusions, let’s change the variables that are not to be interpreted as real numbers to factors or ordered factors (the number of cylinder is a number, but it does not make sense to talk about a car with 6.188 cylinders)

Let’s make things a little more meaningful

mmtcars <- within(mtcars, {
   vs <- factor(vs, labels = c("V-shaped", "Straight"))
   am <- factor(am, labels = c("Automatic", "Manual"))
   cyl  <- ordered(cyl)
   gear <- ordered(gear)
   carb <- ordered(carb)
})

summary(mmtcars)

##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec              vs             am     gear   carb  
##  Min.   :1.513   Min.   :14.50   V-shaped:18   Automatic:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   Straight:14   Manual   :13   4:12   2:10  
##  Median :3.325   Median :17.71                                5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                                       4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                                       6: 1  
##  Max.   :5.424   Max.   :22.90                                       8: 1

Let’s also note that we have information on the car names (in the row names), but it is a bit hidden. To make this more visible, let’s turn this into one factor in the dataframe

mmtcars<-data.frame(row.names(mtcars),mmtcars)
names(mmtcars)[1]<-"Model"
mmtcars

##                                   Model  mpg cyl  disp  hp drat    wt  qsec
## Mazda RX4                     Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46
## Mazda RX4 Wag             Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02
## Datsun 710                   Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61
## Hornet 4 Drive           Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44
## Hornet Sportabout     Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02
## Valiant                         Valiant 18.1   6 225.0 105 2.76 3.460 20.22
## Duster 360                   Duster 360 14.3   8 360.0 245 3.21 3.570 15.84
## Merc 240D                     Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00
## Merc 230                       Merc 230 22.8   4 140.8  95 3.92 3.150 22.90
## Merc 280                       Merc 280 19.2   6 167.6 123 3.92 3.440 18.30
## Merc 280C                     Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90
## Merc 450SE                   Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40
## Merc 450SL                   Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60
## Merc 450SLC                 Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00
## Cadillac Fleetwood   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98
## Lincoln Continental Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82
## Chrysler Imperial     Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42
## Fiat 128                       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47
## Honda Civic                 Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52
## Toyota Corolla           Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90
## Toyota Corona             Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01
## Dodge Challenger       Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87
## AMC Javelin                 AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30
## Camaro Z28                   Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41
## Pontiac Firebird       Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05
## Fiat X1-9                     Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90
## Porsche 914-2             Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70
## Lotus Europa               Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90
## Ford Pantera L           Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50
## Ferrari Dino               Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50
## Maserati Bora             Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60
## Volvo 142E                   Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60
##                           vs        am gear carb
## Mazda RX4           V-shaped    Manual    4    4
## Mazda RX4 Wag       V-shaped    Manual    4    4
## Datsun 710          Straight    Manual    4    1
## Hornet 4 Drive      Straight Automatic    3    1
## Hornet Sportabout   V-shaped Automatic    3    2
## Valiant             Straight Automatic    3    1
## Duster 360          V-shaped Automatic    3    4
## Merc 240D           Straight Automatic    4    2
## Merc 230            Straight Automatic    4    2
## Merc 280            Straight Automatic    4    4
## Merc 280C           Straight Automatic    4    4
## Merc 450SE          V-shaped Automatic    3    3
## Merc 450SL          V-shaped Automatic    3    3
## Merc 450SLC         V-shaped Automatic    3    3
## Cadillac Fleetwood  V-shaped Automatic    3    4
## Lincoln Continental V-shaped Automatic    3    4
## Chrysler Imperial   V-shaped Automatic    3    4
## Fiat 128            Straight    Manual    4    1
## Honda Civic         Straight    Manual    4    2
## Toyota Corolla      Straight    Manual    4    1
## Toyota Corona       Straight Automatic    3    1
## Dodge Challenger    V-shaped Automatic    3    2
## AMC Javelin         V-shaped Automatic    3    2
## Camaro Z28          V-shaped Automatic    3    4
## Pontiac Firebird    V-shaped Automatic    3    2
## Fiat X1-9           Straight    Manual    4    1
## Porsche 914-2       V-shaped    Manual    5    2
## Lotus Europa        Straight    Manual    5    2
## Ford Pantera L      V-shaped    Manual    5    4
## Ferrari Dino        V-shaped    Manual    5    6
## Maserati Bora       V-shaped    Manual    5    8
## Volvo 142E          Straight    Manual    4    2

Plot creation and geometries in the grammar of graphics

The function ggplot() is used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object.

This is advantageous in that the code is explicit about which layers are added and the order in which they are added. The layers that we add are geometric objects, or “geoms”.

You can think of the call ggplot as taking out a piece of paper and of the call geom_something as drawing something on the piece of paper. For example, geom_bar draws a bar plot.

The first argument in ggplot is the data that is going to be visualized. We also need to describe which among the variables in the dataset are used to create the visualization. We do this with aes(), which identifies the variables and on which axes they will be displayed. This can be included in the initialization ggplot(data=..., aes(...)) or in the geometric object geom(aes(...)).

Once this basic plot is created, many layers of refinement can be added on top of it, using the operator + followed by appropriate commands.

So, let’s start by telling R that we want to create a plot using the data in mmtcars.

  ppp<-ggplot(mmtcars)

Note that nothing happens. Let’s try to look at the object ppp

ppp

Indeed, not much different from pulling out a blank page with the intention of drawing on it.

We are now going to look at some of the geometries available. Note that there are often options within these geometries that we will not explore. We will, however, often introduce some modifications to get a sense of the flexibility we have.

  ppp + geom_histogram(aes(x = wt))

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

We obtain a histogram. The bins are quite narrow, and, indeed, if you run this chunk you would get a warning message suggesting you to use a better bin choice. Here is an attempt.

  ppp + geom_histogram(aes(x = wt),binwidth=0.5)

Here is another way of displaying the data relative to one quantitative variable.

  ppp + geom_density(aes(x = wt))

The information displayed is similar…we can think of the density plot as what we get from the histogram as the bin widths grow small, assuming that the function is smooth. In the spirit of learning how to add layers to our plot, lets’ modify the axis labels.

  ppp + geom_density(aes(x = wt)) + labs(x="Weight")

Here is yet another plotting choice for the same type of information. In addition to the label, we also choose a different overall “style” for the plot, selecting a black and white theme that is much more appropriate for printing and projecting. We are going to try to add meaningful modifications to all the remaining plots. While to save space we are going to introduce them directly, the best way to see what they do is to run comment them out from the r-chunk (which you can do putting a # in front of the + option)

  ppp + geom_dotplot(aes(x = wt)) + labs(x="Weight") + theme_bw()

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

The box plot, drawn with the geometry geom_boxplot is a very handy way of displaying the same information. Note that this time we add a title. Notice that we also modify the r chunk to change the size of the figure. The effect is best viewed in the knitted file, but roughtly, fig.width=1.5,fig.height=3 act on the ratio of height to width and the size of plotting characters and fonts, while out.width=200 specifies the actual size of the image

  ppp + geom_boxplot(aes(y = wt)) + labs(y="Weight",title="Boxplot") + theme_bw()

Let’s now think about how to display a qualitative variable, as the number of cylinders.

  ppp + geom_bar(aes(x=cyl)) + labs(x="# cylinders",title="Barplot") + theme_bw()

Now, let’s look at some extra features we can add. We can ask for the coloring of the bars to reflect the proportions of the cars with manual vs automatic transmission. To do this, we modify the aestetics indicating that the bars have to be ‘filled’ with color that represent the variable ‘am’. Note that this will automatically create a legend. We can further modify the lab elements to put a clearer title to the legend (try removing fill="Transmission" from labs() to see what happens)

  ppp + geom_bar(aes(x=cyl, fill = am)) +labs(x="# cylinders",title="Barplot",fill="Transmission")  + theme_bw()

Note that you can also change the colors that are used in the display (indicentally the choice of red and green is not a great one, given its lack of visibility for color blind people). To do this you need to specify a different scale for the fill.

  ppp + geom_bar(aes(x=cyl, fill = am)) +labs(x="# cylinders",title="Barplot",fill="Transmission")  + theme_bw() +scale_fill_manual(values=c("dodgerblue2","gold"))

In other graphs, where the color displayed is determined by the the col variable in an aes(), you will need to use scale_color_manual (see below for an example).

Sometimes it might be useful to disply also a quantitative variable with a bar plot. Typically this is the case when you want to emphatise the identity of the points. In the plot below we look at mpg per car model. Note that we rotate the labels for the names, so that they are legible. We also reduce a bit the size of the fonts (with the option base_size=8 in theme_bw()).

ppp+geom_col(aes(x=Model,y=mpg,fill=am))+theme_bw(base_size=8)+labs(x="Car Model",y="Mile per galon",title="Column plot",fill="Transmission")+ theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.5))

In cases like these, however, an often better option is to simply rotate the plot

ppp+geom_col(aes(x=Model,y=mpg,fill=am))+theme_bw(base_size=8)+labs(x="Car Model",y="Mile per galon",title="Column plot",fill="Transmission")+ coord_flip()

The plot we most commonly use to assess the relation between two quantitative variables is the scatter plot. To do this, we use the geometry geom_point.

  ppp + geom_point(aes(x = wt,y=mpg)) + labs(x="Weight",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()

Note that we can change the color for the points or plotting symbols (without associating them to any variables). And let’s also note how you can put a line or a smooth function through the data.

  ppp + geom_point(aes(x = wt,y=mpg),shape=4,col="red") + labs(x="Weight",y="Miles per gallon", title="Fuel Efficiency") + theme_bw() + geom_smooth(aes(x = wt,y=mpg), method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

If we want shape and/or color of the points to represent an additional variable, we need to call this specification in the aes() function that identifies all the important variables for the plot. In the following plot, we use color to track the number of cylinders, and size to represent the horse power. We also make the points a bit transparent to make sure that readability is conserved when the points overlap (this happens using alpha). Finally notice how in the r chunck specification we have elements that describe the aspect ratio and size of the graph. You should play with all of these and see how they change your outcome.

  ppp + geom_point(aes(x=wt, y=mpg, size=hp, col=cyl),alpha = 0.7) +
  labs(x="Weight", col="# Cylinders", size="Horse Power", 
       title="Fuel Efficiency", subtitle="Source: mtcars data set") + theme_bw()

Note that you could use the geom_point also when one of the variables is a factor.

ppp+geom_point(aes(x=Model,y=mpg,col=am,size=wt,shape=cyl))+theme_bw(base_size=8)+labs(x="Car Model",y="Miles per galon",col="Transmission",size="Weight",shape="# cylinders")+ coord_flip()+ scale_color_manual(values=c("dodgerblue2","gold"))

## Warning: Using shapes for an ordinal variable is not advised

We are now going to see how box plots come in handy to compare multiple distributions. We use geom_boxplot() and we are going to specify an x in the aestetics as well.

  ppp + geom_boxplot(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()

Violin plots are a similar display, which is essentially a density plot turned on its side, and reflected about the vertical axis.

  ppp + geom_violin(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw()

Sometimes, we might want to change direction of display. We can use coord_flip() for this.

  ppp + geom_violin(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw() +coord_flip()

One of the most powerful feature of ggplot is the facility to create “small multiples” plots. To do this we use facet.

In facet_grid, the argument we use is a formula with the rows (of the tabular display) on the LHS and the columns (of the tabular display) on the RHS. A formula is separated in R with the tilde character ~. A dot in the formula is used to indicate there should be no faceting on this dimension (either row or column).

  ppp + geom_histogram(aes(x = hp),binwidth=10)+ facet_grid(cyl ~ .) + labs(x="Horse Power", title="Exerimenting with Facets") + theme_bw()

Note how the following graph implements the idea of ``small multiples’’, maximizing data ink and taking advantage of the gestalt principle of symmetry.

  ppp + geom_point(aes(x = hp,y=mpg))+ facet_grid(cyl ~ as.factor(am)) + labs(x="Horse Power",y="Miles per gallon", title="Fuel Efficiency, Facets version") + theme_bw()

To add on to this overview of tools available in R to look at the relations between variables in a small multiple fashion, it is useful to look at ggpairs. Now, the figure that the command below creates might be a little overwhelming, but you do not necessarily have to use it for all the variables at the same time.

It is useful to see how it treats differently quantitative vs qualitative variables and how you can represent their dependence. Note that some variables, like cylinders are now coded as ordered, but one could argue for them to be quantitative, at least for this display.

  ggpairs(mmtcars[,c(-1)])

For the sake of comparison, let’s take a look at how the plot changes if we revert back to the original variable codings.

  ggpairs(mtcars)

To have a little more visibility, we can look at a subset of variables

  ggpairs(mmtcars[,c("mpg","cyl","disp","vs")])

Finally, you want to be aware that, while facets have some advantages in optimizing scales and symmetries, they are not the only way of putting multiple plots in the same graphical window.

You can use grid.arrange to place multiple plots within the same window. For a very basic example, let’s put together a couple of graphs we already created. And you can find more examples here

p1<- ppp + geom_boxplot(aes(x = cyl,y=mpg))+ labs(x="# cylinders",y="Miles per gallon", title="Fuel Efficiency") + theme_bw() 
p2<-ppp + geom_histogram(aes(x = hp),binwidth=10)+ facet_grid(cyl ~ .) + labs(x="Horse Power", title="Exerimenting with Facets") + theme_bw()
grid.arrange(p1,p2,nrow=1)

Graphics in R

A bit of backgroung on ggplot2

Exploring ggplot

Plot creation and geometries in the grammar of graphics