Introduction to R studio: Part II

DATASCI 120 Teaching Team

2026-03-27

Displaying maps

We now consider in detail one particular type of graphics that is often not covered in standard courses: maps.

Each region on a map is represented as a polygon, which is essentially a shape made up of multiple points connected by lines to form a closed boundary. These polygons are crucial for mapping as they define the spatial boundaries of different geographic areas.

One popular package in R for working with maps is maps. The maps package provides a simple interface to access a variety of map databases, offering detailed map outlines for countries, states, and world regions. When these map databases are loaded, each region (like a country) is broken down into its polygonal components, each described by a series of longitude and latitude points. The ggplot2 package can then take these points and visually render them as polygons, allowing for the addition of aesthetic mappings such as colors to represent different data variables (like population, GDP, or health statistics).1 Before running the code, make sure that the required packages ggplot2 and maps are installed.

We will explore examples of two distinct types of maps: dot maps and choropleths.

Dot maps

There is a built-in dataset called quakes that contains information on earthquake occurrences, detailing attributes like magnitude, depth, and location coordinates for seismic events in the Fiji area. Our goal is to produce a dot mat to visualize the magnitude and depth of seismic events. We shall approach this in three steps.

data("quakes")  # Load the earthquakes data
head(quakes)    # Display the first few rows
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

Step 1: World Map

First, let’s use the map_data() function to retrieve data for the world map. The columns of the retrieved data set store the coordinates (longitude and latitude), along with identifiers that help in grouping and ordering these points to form the shapes of countries or regions on a map.

world_map <- map_data("world")
head(world_map)
##        long      lat group order region subregion
## 1 -69.89912 12.45200     1     1  Aruba      <NA>
## 2 -69.89571 12.42300     1     2  Aruba      <NA>
## 3 -69.94219 12.43853     1     3  Aruba      <NA>
## 4 -70.00415 12.50049     1     4  Aruba      <NA>
## 5 -70.06612 12.54697     1     5  Aruba      <NA>
## 6 -70.05088 12.59707     1     6  Aruba      <NA>

Now, we are ready to use ggplot to produce the map.

map1= ggplot() +
  geom_polygon(data = world_map, 
               aes(x = long, y = lat, group = group), 
               fill = "lightgray", color = "black") +
  theme_minimal()
map1+labs(x = "Longitude", y = "Latitude")

Simple world map.

Simple world map.

Step 2: Zooming in on a region.

Next, let’s use coord_fixed() to zoom in on the Fiji area.

World Map with Fiji earthquakes data. World Map with Fiji earthquakes data.

map2 = map1 +
  coord_fixed(ratio = 1, xlim = c(160, 190), ylim = c(-50, -10))
map2 + labs(x = "Longitude", y = "Latitude", 
             color = "Magnitude")
## Ignoring unknown labels:
## • colour : "Magnitude"

Step 3: Finally. we are ready to overlay the map with dots such that a deeper color indicates greater depth and a larger size indicates greater magnitude.

map2+
  geom_point(data = quakes, 
             aes(x = long, y = lat, size =depth , color = mag),
             alpha = 0.5) +
  scale_size(range = c(0.1, 2)) +
  scale_color_gradient(low = "blue", high = "red") +
  labs( x = "Longitude", y = "Latitude",
        color = "Magnitude",size="Depth") +
  theme_minimal()

Earthquakes around Fiji.

Earthquakes around Fiji.

Choropleths

Next, let’s make a choropleth of the United States with each state colored according the the percentage of the population living in urban areas in 1973. We will work with another built-in dataset called USArrests, which contains statistics on arrests per 100,000 residents for assault, murder, and rape in each of the U.S. states in 1973, along with the percentage of the population living in urban areas.

data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# Get map data for US states (this is in the ggplot package, but you can also download your own lat, long coordinates)
states_map <- map_data("state")

# Merge map data with USArrests data
USArrests$state <- tolower(rownames(USArrests))
arrest_map <- states_map %>%
  left_join(USArrests, by = c("region" = "state"))

# Plot with ggplot2
ggplot(arrest_map, aes(long, lat, group = group, fill = UrbanPop)) +
  geom_polygon(color = "white", linewidth = 0.3) +
  coord_fixed(1.3) +
  scale_fill_gradientn(
    colors = colorRampPalette(c("lightblue", "darkblue"))(100),
    name = "Urban Population (%)"
  ) +
  labs(title = "Urban Population by US State") +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5)
  )

Urban population by state

Urban population by state

An alternative to making choropleths is using the maps() package, instead of using ggplot().

Working with Categorical Variables

Categorical variables in R are stored in a data type called a factor. It stores both the values of the variable and its unique categories, known as the levels of a factor. Even if they are not explicitly specified, R can determine a factor’s levels. We will discuss how to work with qualitative and ordinal factors through two example datasets.

Renaming Variables and Levels

Let’s first take a look at the iris dataset, which contains 150 total measurements of sepal and petal lengths and widths for three species of Iris flowers. Species is a qualitative variable stored as a factor, so we can print the levels and see that the species included are Setosa, Versicolor, and Virginica. We can make plots to visually observe differences between the species.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()

Petal width and length for Iris species.

Petal width and length for Iris species.

Even though the variable name Species is informative, someone who is unfamiliar with the dataset or flowers may not know that these are referring to Iris species. We can either rename the variable or the levels to clarify exactly what flowers these correspond to. To rename the variable, we modify the vector returned by names() which contains all the column names of iris. Alternatively, tidyverse provides a function rename that gives the same result.

# Rename variable
iris_copy <- data.frame(iris)
names(iris_copy)[names(iris_copy) == "Species"] <- "Iris Species"
head(iris_copy)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Iris Species
## 1          5.1         3.5          1.4         0.2       setosa
## 2          4.9         3.0          1.4         0.2       setosa
## 3          4.7         3.2          1.3         0.2       setosa
## 4          4.6         3.1          1.5         0.2       setosa
## 5          5.0         3.6          1.4         0.2       setosa
## 6          5.4         3.9          1.7         0.4       setosa
# Rename variable using tidyverse
library(tidyverse)
iris_copy <- iris %>% rename("Iris Species" = Species)

We could also rename the levels instead of the variable name.

iris_copy <- data.frame(iris)
levels(iris_copy$Species)
## [1] "setosa"     "versicolor" "virginica"
# To rename specific levels
levels(iris_copy$Species)[levels(iris_copy$Species) == "versicolor"] <- "Iris versicolor"
levels(iris_copy$Species)
## [1] "setosa"          "Iris versicolor" "virginica"
# To rename all levels
levels(iris_copy$Species) <- c("Iris setosa", "Iris versicolor", "Iris virginica")
levels(iris_copy$Species)
## [1] "Iris setosa"     "Iris versicolor" "Iris virginica"
ggplot(iris_copy, aes(x = Petal.Length, y = Petal.Width, col = Species)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()

Petal width and length for Iris species.

Petal width and length for Iris species.

Creating factors

The next example will show how to create a factor, whether for qualitative or ordinal variables. In some datasets, categorical variables may be stored as another data type and should be converted to a factor first.

Ordinal variables are categorical variables where the categories have an inherent order. One example is the months of a year (January, February, etc.), which are ordered and often represented numerically (1, 2, etc.). In the airquality dataset, there are daily air quality measurements from May to September 1973 in New York. We demonstrate how to encode the ordinal variable Month as a factor.

data("airquality")
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Currently, Month is stored as integers in the dataframe. To encode a qualitative variable as a factor, we can use the function factor or as.factor.

airquality$Month <- as.factor(airquality$Month)
airquality$Month
##   [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
##  [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
##  [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9
## Levels: 5 6 7 8 9

For an ordinal variable, we want to also specify that the levels are ordered. This allows us to use comparison operators which would not be defined for the unordered factor above.

airquality$Month <- factor(airquality$Month, ordered = TRUE,
                           levels = c("5", "6", "7", "8", "9"))
airquality$Month
##   [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
##  [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
##  [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9
## Levels: 5 < 6 < 7 < 8 < 9

Finally, let’s rename the levels to be more informative than its original numerical values. Now we can make plots using Month as a categorical variable.

levels(airquality$Month) <- c("May", "June", "July", "August", "September")
ggplot(airquality, aes(x = Month, y = Temp)) +
  geom_boxplot() +
  labs(y = "Temperature") +
  theme_minimal()

Temperature in New York 1973.

Temperature in New York 1973.

Facets

Facets are a powerful tool for visualizing how the relationship between two or more variables depends on other variables. For instance, let’s say we’re interested in understanding the relationship between petal length and petal width for flowers. Depending on the species, the nature of this relationship might differ, so it can be helpful to have separate plots for each flower type in our data. The facet_grid method makes this easy:

plot <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()
plot + facet_grid(. ~ Species)

Petal length vs width by species

Petal length vs width by species

Note that the above generated horizontally organized subplots. If vertically organized subplots are desired, then instead of . ~ [variable], you should use [variable] ~ . as an input to facet_grid.

plot + facet_grid(Species ~ .)

Vertical Subplots

Vertical Subplots

The facet_grid method can also be used to generate subplots for each combination of two categorical variables. To see this in action, let’s start by creating a new variable that indicates whether a flower has sepal width greater than half its sepal length.

iris <- mutate(iris, wide_sepals = ifelse(Sepal.Width > 0.5 * Sepal.Length,
               "Wide", "Thin"))
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species wide_sepals
## 1          5.1         3.5          1.4         0.2  setosa        Wide
## 2          4.9         3.0          1.4         0.2  setosa        Wide
## 3          4.7         3.2          1.3         0.2  setosa        Wide
## 4          4.6         3.1          1.5         0.2  setosa        Wide
## 5          5.0         3.6          1.4         0.2  setosa        Wide
## 6          5.4         3.9          1.7         0.4  setosa        Wide

Next, using the input wide_sepals ~ Species, we generate subplots that are vertically divided by sepal width and horizontally divided by species.

plot <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()
plot + facet_grid(wide_sepals ~ Species)

Example of Two-Variable Facet Grid

Example of Two-Variable Facet Grid

The above plot makes it visually clear that the ratio between sepal width and sepal length doesn’t seem to affect the relationship between petal width and petal length for any of the three species in the iris data. Depending on your data, the best method for visualizing specific relationships might involve a combination of color-coding and facets. An example is below:

plot <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()
plot + facet_grid(. ~ wide_sepals)

Example of Combining Facets and Color-Coding

Example of Combining Facets and Color-Coding

Changing colors

We can also easily change the colors if we wanted to using scale_color_manual. There are a variety of colors, some of which you can see here: https://sape.inf.usi.ch/quick-reference/ggplot2/colour.

Generally, it is a good idea to use colors consistently throughout your document. For example, if you always display points or lines for setosa in orange, then in any figure, the data related to setosa should be colored orange.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  facet_grid(. ~ wide_sepals) + 
  theme_minimal() + 
  scale_color_manual(values = c("darkorange", "cyan4", "lightcoral"), 
                     labels = c("Setosa", "Versicolor", "Virgincia"))

Example of Changing Colors

Example of Changing Colors

Facet Wrap

By default, calling facet_grid([Variable 1] ~ [Variable 2]) will create an grid of subplots, where are the number of categories in Variable 1 and Variable 2 respectively. To manually control the dimensions of your facet grid, you can use the facet_wrap() method, which takes a number of columns ncol as an argument. In the following example, we set the number of columns to 2:

plot <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + 
  geom_point() + 
  labs(x = "Petal Length", y = "Petal Width") + 
  theme_minimal()
plot + facet_wrap( ~ Species, ncol = 2)

Example of Facet Wrap

Example of Facet Wrap

Pie charts

Below we make a pie chart for species in the Iris dataset. Note that here there is the same number observations for each species, but we do include a pie chart below for illustration purposes. We also include labeling by percentages, just to illustrate the approach. You can modify the plot below for your own purposes.

# Count species
species_counts <- as.data.frame(table(iris$Species))
colnames(species_counts) <- c("Species", "Count")

# Calculate percentages and labels
species_counts$Percent <- round(100 * species_counts$Count / sum(species_counts$Count), 1)
species_counts$Label <- paste0(species_counts$Species, "\n", species_counts$Percent, "%")

# Plot with ggplot
ggplot(species_counts, aes(x = "", y = Count, fill = Species)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +  # turns bar into pie
  theme_void() +  # remove background
  geom_text(aes(label = Label), 
            position = position_stack(vjust = 0.5)) +
  labs(title = "Distribution of Iris Species")

Example of Pie Charts

Example of Pie Charts

Timeseries

The lubridate package, which is part of tidyverse is designed to handle dates in R. Here is an example:

ymd(c("20160927", "20160230"))
## Warning: 1 failed to parse.
## [1] "2016-09-27" NA
mdy(c("6/12/16", "2/9/16"))
## [1] "2016-06-12" "2016-02-09"
dmy(c("1/9/2016", "26/9/16"))
## [1] "2016-09-01" "2016-09-26"

We will use data from New York’s Department of Sanitation below. The data was downloaded from here https://data.cityofnewyork.us/City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5.

You can find the dataset on Canvas in the Rintro folder: https://canvas.stanford.edu/courses/206498/files/folder/Rintro. Download the file and put it into your data directory.

NY_waste <- read.csv("NY waste.csv")
head(NY_waste)
##         date   borough district              type  value
## 1 2005-01-01 Manhattan       09            refuse 3006.0
## 2 2005-01-01 Manhattan       09             paper  276.9
## 3 2005-01-01 Manhattan       09               mgp  180.3
## 4 2005-01-01 Manhattan       09 resident_organics    0.0
## 5 2005-01-01 Manhattan       09   school_organics    0.0
## 6 2005-01-01 Manhattan       09            leaves    0.0

This data set contains information about the amount of waste collected in different locations in New York city. The first column specifies the date of collection. The next two columns specify the borough and district (location). The next column specifies the type of rubbish collection ranging from refuse (general waste) to Christmas trees. Let’s use the waste dataset to look at trends over time across all of New York city. To do this we should group by date and type and use summarize() to add up the amount of waste across all districts.

aggregate_waste <- NY_waste %>% 
  group_by(date, type) %>% 
  summarize(amount = sum(value))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
aggregate_waste
## # A tibble: 1,498 × 3
## # Groups:   date [214]
##    date       type               amount
##    <chr>      <chr>               <dbl>
##  1 2005-01-01 leaves                 0 
##  2 2005-01-01 mgp                18156 
##  3 2005-01-01 paper              28658.
##  4 2005-01-01 refuse            208304.
##  5 2005-01-01 resident_organics      0 
##  6 2005-01-01 school_organics        0 
##  7 2005-01-01 xmas_trees          1376.
##  8 2005-02-01 leaves                 0 
##  9 2005-02-01 mgp                18071.
## 10 2005-02-01 paper              29118 
## # ℹ 1,488 more rows

Let’s first look at the overall proportions of waste across time. Let’s consider a pie-chart for this:

# Summarize total waste by type
waste_summary <- aggregate_waste %>%
  group_by(type) %>%
  summarise(total_amount = sum(amount)) %>%
  ungroup()

# Create pie chart
ggplot(waste_summary, aes(x = "", y = total_amount, fill = type)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Total Waste by Type", fill = "Waste Type") +
  theme_void()

Pie chart of waste by type

Pie chart of waste by type

We see that most waste comes from paper, mgp and refuse. Next, we are interested in exploring trends over time, for different types:

# Make sure that date is date type 
aggregate_waste <- aggregate_waste %>%
  mutate(date = as.Date(date))

# Plot
ggplot(aggregate_waste, aes(x = date, y = amount)) +
  geom_line(color = "steelblue") +
  facet_wrap(~ type, scales = "free") + 
  labs(title = "Time Series of Waste by Type",
       x = "Date",
       y = "Amount of waste") +
  theme_minimal()

Time series waste by type

Time series waste by type

Based on the above, we see some interesting patterns: For example, christmas trees spike at certain times, and school organics are zero during certain times. Let’s calculate the total by year and month, and filter out refuse, mgp and paper:

monthly_waste <- aggregate_waste %>% 
  mutate(month = month(date),
         year = year(date)) %>%
  group_by(month, type) %>% 
  summarize(amount = mean(amount)) %>% 
  ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
monthly_waste_filtered = monthly_waste %>% 
  filter(!(type %in% c("refuse", "mgp", "paper")))
monthly_waste_filtered %>%
  ggplot(aes(x = month, y = amount, color = type)) +
  geom_line() +
  scale_x_continuous(breaks = (1:12))  + 
  labs(title = "Waste by Month",
       x = "Month",
       y = "Amount of waste", 
       col = "Waste Type")

Waste by month and type

Waste by month and type

We can see a lot of seasonal effects in the last two graphics. Three that stand out are:

Let’s say we are interested in seeing whether there are any patter by whether it was the Covid year (2020) or not (previous).

refuse_2020_b <- NY_waste %>% 
  mutate(year = year(date),
         month = month(date),
         is_2020 = (year == 2020)) %>% 
  filter(year <= 2020,
         type == "refuse") %>% 
  group_by(borough, month, is_2020) %>% 
  summarize(amount = mean(value)) 
## `summarise()` has grouped output by 'borough', 'month'. You can override using
## the `.groups` argument.
ggplot(refuse_2020_b) +
  geom_line(aes(x = month, y = amount, color = is_2020)) +
  facet_wrap(~borough) + 
  labs(x = "Month", y = "Amount", color = "2020?") + 
  scale_x_continuous(breaks = seq(1, 12))

Comparing waste in covid versus non-covid year

Comparing waste in covid versus non-covid year

We see that during Covid, a business district such as Manhattan had lower waste. However, Bronx, Brooklyn, Queens and Staten Island had higher waste in the second half of the year.