Resources

The content of this introduction borrows material from the course R for Undergraduates (Stats 32).

Installation

To install R go to: https://cran.r-project.org/. Click the link at the top of the page that says “Download R for”, and follow the instructions on that page. For Mac you will want to download the latest .pkg file compatible with your operating system. For Windows you can directly follow the link to the base distribution.

Next, you should install RStudio Desktop from https://www.rstudio.com/products/rstudio/download/, following instructions according to your operating system.

RStudio

When you open RStudio, you will see 3 different windows along with a number of tabs:

R Markdown

What is R Markdown?

R Markdown is an authoring format that combines text with embedded R code to create dynamic, reproducible documents. It leverages the simplicity of Markdown for formatting text alongside R for performing statistical analyses and generating visualizations. This integration allows users to seamlessly weave narrative and computational content into a single cohesive document.

The blue text with the hashes ## are headings. The number of hashes corresponds to whether you want a title, subtitle, etc.

To include R code you need to write a code “chunk”.

Creating a new R Markdown and saving to pdf

If you want to create a new R Markdown, click File (at the top left), then New File and then R Markdown.

You can “compile” your R Markdown in to html, pdf and other documents. This process is called “knitting”. You can knit the document via the “knit” button at the top of the file (blue yarn with knitting needle).

For your first homework, to produce a pdf, you can follow one of these steps:

1. Knit to html, then print as pdf.

This is described in the homework itself: “To convert these to .pdf files, open them in your browser and PRINT them: instead of selecting a physical printer, look for options like Save as PDF, Microsoft Print to PDF, or Adobe PDF. This will create a pdf file that spans multiple pages.”

Specifically: 1. Knit your file 2. Click open in browser (top right of the output generated) or locate the file in your directory and open it directly. 3. Print the webpage (Cmd + ‘P’ on Mac or Ctrl + P on windows). 4. Select “Save as PDF”

2. Knit directly to pdf.

This is what the ideal process would be, but we do not require it for the first homework. If you would like to create PDF documents from R Markdown, you will need to have a LaTeX distribution installed. Although there are several traditional options including MiKTeX, MacTeX, and TeX Live,you can also install TinyTeX.

install.packages('tinytex')
tinytex::install_tinytex()

After you have installed the package, you should restart Rstudio. To restart RStudio, go to the Session tab and select “Restart R”. After you’ve restarted, create a new R Markdown document by selecting the new file button at the top right. After you’ve selected that button, choose the R Markdown option, select “PDF” as your output format, and use the default settings of the R Markdown Document.

Note: You will get a .pdf popup. Do not print this popup. In the same folder where your Rmarkdown file lives, there will be a .pdf document that you can use.

Some aspects of using R for data science

We will now go over some aspects of using R.

Basic R

Use R as a calculator:

1 + 2
## [1] 3
# 567*7
# 5/2

Variable assignment

Often, we want to store the result of a computation so that we can use it later. R allows us to do this by variable assignment. Variable names must start with a letter and can only contain letters, numbers, _ and ..

x <- 2

Notice that no output was printed. This is because the act of variable assignment doesn’t produce any output. If we want to see what x contains, simply key its name into the console:

x
## [1] 2

Note that x has been added to the environment (top right).

You can use rm() to remove variables from the environment

rm(x)

Types of variables

The most common data types in R are numeric, character, logical and factor. You can check the type using typeof() function.

typeof("Hi")
## [1] "character"
typeof(1)
## [1] "double"
typeof(TRUE)
## [1] "logical"

Using variables

We can use numeric variables in computations:

a <- 3
b <- 2
a^2 + 3*b
## [1] 15

We can reassign a to a different value. This new value may be defined using a.

a <- a + 1
a
## [1] 4

Vectors

For data analysis, we often have to work with multiple values at the same time. There are a number of different R data structures which allow us to do this. The vector is a 1-dimensional array whose entries are the same type. For example, the following code produces a vector containing the character data “Stanford”, “Harvard” and “Columbia”:

universities <- c("Stanford", "Harvard", "Columbia")
universities
## [1] "Stanford" "Harvard"  "Columbia"

Certain vectors can be produced using short cuts. To produce a vector with the numbers 1,2,3,4,5 we can use 1:5 instead of c(1,2,3,4,5).

numbers <- 1:5
numbers
## [1] 1 2 3 4 5

You produce very large vectors quickly using this.

1:20
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Indexing

R allows us to access individual elements in a vector. Unlike many other programming languages, indexing begins at 1, not 0. For example, to return the first entry in the vector universities, I would use the following code:

universities[1]
## [1] "Stanford"

To get more than one entry from a vector you cannot just write more than one number inside the square brackets. This will return an error

universities[1,3]

To get more than one entry, you need to but a vector inside the square brackets. So you need to write c(1,3) instead of 1,3.

universities[c(1,3)]
## [1] "Stanford" "Columbia"

Lists

In vectors, the elements have to be of the same type. To have elements with different types in one data structure, we can use a list.

Lists are made in the same way as vectors but you write list() instead of c(). You also have the option of naming each element in the list. Below we make a named list called person.

celebrity <- list(person = "Ryan Reynolds", age = 45)
celebrity
## $person
## [1] "Ryan Reynolds"
## 
## $age
## [1] 45

To access the person element in celebrity, write celebrity$person

celebrity$person
## [1] "Ryan Reynolds"

This is the quickest way to do indexing in a list. If the lists are not named you can also do

celebrity[[1]]
## [1] "Ryan Reynolds"

You can also use $ to add new elements to a list. We can add Ryan Reynolds’ job to the list `celebrity’ like so

celebrity$job <- "Actor"
celebrity
## $person
## [1] "Ryan Reynolds"
## 
## $age
## [1] 45
## 
## $job
## [1] "Actor"

The elements of a list can be anything, even another data structure (vector, matrix, etc.)! We can add a vector with the names of Ryan Reynolds’ children to the list celebrity.

celebrity$children <- c("James", "Inez", "Betty")
celebrity
## $person
## [1] "Ryan Reynolds"
## 
## $age
## [1] 45
## 
## $job
## [1] "Actor"
## 
## $children
## [1] "James" "Inez"  "Betty"

Our list is getting quite long. The str() function (short for structure) provides a useful summary of the overall structure of a complex data structure. Running str(celebrity) gives a more condensed view of our list

str(celebrity)
## List of 4
##  $ person  : chr "Ryan Reynolds"
##  $ age     : num 45
##  $ job     : chr "Actor"
##  $ children: chr [1:3] "James" "Inez" "Betty"

To see the names associated with a list, use the names() function:

names(celebrity)
## [1] "person"   "age"      "job"      "children"

Matrices

As with vectors, elements of matrices have to be of the same type.

We use the matrix() command to change the vector c(1,2,3,4,5,6) into a matrix:

A <- matrix(c(1,2,3,4,5,6), nrow = 2)
A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Notice that R takes the elements in the vector you give it and fills in the matrix column by column. If we want the elements to be filled in by row instead, we have to put in a byrow = TRUE argument:

B <- matrix(c(1,2,3,4,5,6), nrow = 2, byrow = TRUE)
B
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

To get the dimensions of the matrix, we can use the dim(), nrow() and ncol() functions.

dim(B)
## [1] 2 3
nrow(B)
## [1] 2
ncol(B)
## [1] 3

To access the element in the ith row and jth column for the matrix B, use the index i,j:

B[1, 2]
## [1] 2

Installing packages

You install packages using install.packages(). Here, we will use the package tidyverse and palmerpenguins (for penguin data). To install packages you use install.packages().

install.packages(c("tidyverse", "palmerpenguins"))

Once you have installed a package, every time you want to use it you need to load the package:

Plotting

Datset overview

We use `ggplot() for plotting. This is included in the tidyverse() package.

Let’s first look at the penguins data. Summary allows you to get a summary of each variable:

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Using head shows you the first few rows of the dataset:

head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

To find out the dimensions of the data, use:

dim(penguins)
## [1] 344   8
ncol(penguins)
## [1] 8
nrow(penguins)
## [1] 344

If we are interested in a particular variable, we can use:

summary(penguins$sex)
## female   male   NA's 
##    165    168     11
summary(penguins$bill_length_mm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.10   39.23   44.45   43.92   48.50   59.60       2

Plotting

Let’s plot body mass as a function of flipper length:

ggplot(data = penguins) +
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Note also that a warning is printed to let us know that there are NAs in our data set. A warning is different to an error. A warning lets us know that something might have gone wrong but our code still runs. An error means that our code failed to run.

Let’s now create a scatter plot with flipper length on the x axis, body mass on the y axis and points colored by sex. To do this, we can use the same code as before but add the mapping color = sex inside aes().

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

By default, ggplot will use the column names to name the axis and legend in plots. By default the plots do not have titles. Both of these can be changed by using the command labs() (short for labels).

ggplot(data = penguins) +
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species),
             size = 2.5, alpha = 0.7) +
  labs(title = "Penguin size", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Histograms

We want to make a histogram of flipper length, colored by species. To change the color on the inside of the bars, we need to use the aesthetic fill.

ggplot(data = penguins) +
  geom_histogram(mapping = aes(x = flipper_length_mm, fill = species)) +
  labs(
    x = "Flipper length (mm)", 
    y = "Count", 
    fill = "Species"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Note that when we changed fill, the bars stayed the same height. The Adelie bars where stacked on top of the Chinstrap bars and both were on top of the Gentoo bars. If we do not want the bars stacked on top of each other, we can an optional argument of geom_histogram() called position. If we set position = “identity”, then instead of a stacked histogram we will get an individual histogram for each species.

ggplot(data = penguins) +
  geom_histogram(mapping = aes(x = flipper_length_mm, fill = species), 
                 position = "identity") +
  labs(
    x = "Flipper length (mm)", 
    y = "Count", 
    fill = "Species"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Facets

Another way we can compare the distribution of flipper_length_mm across the different species is by using facets. We can split the above plot into three smaller plots called facets. To do this we can use facet_wrap(). We have to use a tilde ~ and then the name of the variable we want to facet by. Here we add a facet to the histogram of flipper_length_mm.

ggplot(data = penguins) +
  geom_histogram(mapping = aes(x = flipper_length_mm, fill = species), show.legend = FALSE) +
  facet_wrap(~ species)  +
  labs(
    x = "Flipper length (mm)", 
    y = "Count"
  ) 
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Bar graph

A bar graph can be used to summarize a categorical variable. Here is a bar graph of the variable island.

ggplot(data = penguins) +
  geom_bar(mapping = aes(x = island)) +
  labs(
    x = "Island", 
    y = "Count"
  ) 

We can also use bar plots to show the interaction between two categorical variables. One way to do this is with faceting. Let’s add a facet based on species and also fill by species.

ggplot(data = penguins) + 
  geom_bar(mapping = aes(x = island, fill = species),
           show.legend = FALSE) +
  facet_wrap(~species) +
  labs(
    x = "Island", 
    y = "Count"
  ) 

Many interactions

We want to show the interaction between how the interaction between species, sex, flipper_length_mm and body_mass_g.

ggplot(data = penguins) +
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species, shape = sex),
             size = 2.5, alpha = 0.7)
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

Unfortunately, the different shapes are still a bit confusing. We might get a clearer result if we facet by species and color by sex.

ggplot(data = penguins) +
    geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species),
             size = 2.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~ species) +
  labs(
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)"
  ) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

### Changing the theme

All of our plots have been using the default ggplot theme. By adding a new layer we change the theme. Here are a few examples:

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) + 
  facet_wrap(~ species) +
  theme_minimal()  +
  labs(
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)"
  ) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) + 
  facet_wrap(~ species) +
  theme_linedraw()  +
  labs(
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)"
  ) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Changing the colors

You can manually set the colors by adding the layer scale_color_manual().

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) + 
  facet_wrap(~ species) +
  scale_color_manual(values = c("darkorange", "cyan4"), labels = c("Female", "Male", "Unknown")) +
  theme_minimal()  +
  labs(
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)"
  ) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Changing points

You can change the shape of the points, see here: https://r4ds.had.co.nz/data-visualisation.html

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, fill = sex, shape = sex), size = 3) + 
  facet_wrap(~ species) +
  scale_fill_manual(values = c("darkorange", "cyan4")) +
  scale_shape_manual(values = c(21, 22)) +
  theme_minimal()  +
  labs(
    x = "Flipper Length (mm)", 
    y = "Body Mass (g)"
  ) 
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

Tidyverse

Importing data

One easy way to import data is clicking “File > Import Dataset” and follow the suggestions from there. Functions you can use are, for example, read_csv(). This will import the data and paste the code for it in the console. You can copy that code from the console.

You can see which directory R is using by default to import data by looking at the bottom right and the Files pane, or you can run getwd() to see the working directory or setwd() to set it.

We will use data from New York’s Department of Sanitation below. The data was downloaded from here https://data.cityofnewyork.us/City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5.

NY_waste <- read.csv("NY waste.csv")
head(NY_waste)
##         date   borough district              type  value
## 1 2005-01-01 Manhattan       09            refuse 3006.0
## 2 2005-01-01 Manhattan       09             paper  276.9
## 3 2005-01-01 Manhattan       09               mgp  180.3
## 4 2005-01-01 Manhattan       09 resident_organics    0.0
## 5 2005-01-01 Manhattan       09   school_organics    0.0
## 6 2005-01-01 Manhattan       09            leaves    0.0

This data set contains information about the amount of waste collected in different locations in New York city. The first column specifies the date of collection. The next two columns specify the borough and district (location). The next column specifies the type of rubbish collection ranging from refuse (general waste) to Christmas trees.

Filtering and selecting

We use filter() to subset rows and select() to choose specific columns. Here, to demonstrate we use the penguins data again.

# Filter for "Adelie" species and select specific columns
adelie_penguins <- penguins %>%
  filter(species == "Adelie") %>%
  select(species, island, bill_length_mm, bill_depth_mm)

head(adelie_penguins)
## # A tibble: 6 × 4
##   species island    bill_length_mm bill_depth_mm
##   <fct>   <fct>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6

Create new variables

The mutate() function is used to create or transform columns. Here, we create a new column called bill_ratio (bill length divided by bill depth).

penguins <- penguins %>%
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

head(penguins)
## # A tibble: 6 × 9
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>

Summarizing data

We can use summarize() to summarize data. We can summarize by group for example, if we use group_by() as well.

penguin_summary <- penguins %>%
  group_by(species) %>%
  summarise(
    avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
    avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
    count = n()
  )

penguin_summary
## # A tibble: 3 × 4
##   species   avg_bill_length avg_bill_depth count
##   <fct>               <dbl>          <dbl> <int>
## 1 Adelie               38.8           18.3   152
## 2 Chinstrap            48.8           18.4    68
## 3 Gentoo               47.5           15.0   124

Data reshaping

The pivot_longer() function converts the dataset from wide to long format. Here, gathering bill measurements into a single column.

penguins_long <- penguins %>%
  pivot_longer(
    cols = c(bill_length_mm, bill_depth_mm),
    names_to = "measurement",
    values_to = "value"
  )

head(penguins_long)
## # A tibble: 6 × 9
##   species island    flipper_length_mm body_mass_g sex     year bill_ratio
##   <fct>   <fct>                 <int>       <int> <fct>  <int>      <dbl>
## 1 Adelie  Torgersen               181        3750 male    2007       2.09
## 2 Adelie  Torgersen               181        3750 male    2007       2.09
## 3 Adelie  Torgersen               186        3800 female  2007       2.27
## 4 Adelie  Torgersen               186        3800 female  2007       2.27
## 5 Adelie  Torgersen               195        3250 female  2007       2.24
## 6 Adelie  Torgersen               195        3250 female  2007       2.24
## # ℹ 2 more variables: measurement <chr>, value <dbl>

Conversely, use pivot_wider() to convert from long to wide.

Penguins linear model and correlation

You can calculate correlations using cor(). You can fit a linear model using lm().

# Fit a linear model: body_mass_g ~ flipper_length_mm
lm_model <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(lm_model)
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1058.80  -259.27   -26.88   247.33  1288.69 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5780.831    305.815  -18.90   <2e-16 ***
## flipper_length_mm    49.686      1.518   32.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394.3 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.759,  Adjusted R-squared:  0.7583 
## F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

We also plot the linear model on a scatter plot:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(
    title = "Linear Model: Body Mass vs. Flipper Length",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).