Book R for Data Science by Hadley Wickham and Garret Grolemund. This is an excellent resource to learn more about how to use the data in R, how to transform it, model it and visualize it.
Introduction to R, Rstudio. This is the introduction linked in your homework 1. We will go over parts of this document in section and summarize some parts below.
The content of this introduction borrows material from the course R for Undergraduates (Stats 32).
To install R go to: https://cran.r-project.org/. Click the link at the top of the page that says “Download R for”, and follow the instructions on that page. For Mac you will want to download the latest .pkg file compatible with your operating system. For Windows you can directly follow the link to the base distribution.
Next, you should install RStudio Desktop from https://www.rstudio.com/products/rstudio/download/, following instructions according to your operating system.
When you open RStudio, you will see 3 different windows along with a number of tabs:
Bottom-left This is the R console, where you key in commands to be run in an interactive fashion. Type in your command and hit the Enter key. Once you hit the Enter key, R executes your command and prints the result, if any.
Top-right Environment: List of objects that we have created or have access to. We can also see the list of objects using the command ls(). History: List of commands that we have entered into the console.
Bottom-right Files: Allows you to navigate the directory structure on your computer. Plots: Any graphical output you make will be displayed here. Packages: List of packages that have been installed on your computer. Help: Documentation for functionName appears here when you type ?functionName in the console. Viewer: For displaying local web content.
Top-left This space will will be your Rscript / Rmarkdown file.
R Markdown is an authoring format that combines text with embedded R code to create dynamic, reproducible documents. It leverages the simplicity of Markdown for formatting text alongside R for performing statistical analyses and generating visualizations. This integration allows users to seamlessly weave narrative and computational content into a single cohesive document.
The blue text with the hashes ## are headings. The
number of hashes corresponds to whether you want a title, subtitle,
etc.
To include R code you need to write a code “chunk”.
If you want to create a new R Markdown, click File (at the top left), then New File and then R Markdown.
You can “compile” your R Markdown in to html, pdf and other documents. This process is called “knitting”. You can knit the document via the “knit” button at the top of the file (blue yarn with knitting needle).
For your first homework, to produce a pdf, you can follow one of these steps:
This is described in the homework itself: “To convert these to .pdf files, open them in your browser and PRINT them: instead of selecting a physical printer, look for options like Save as PDF, Microsoft Print to PDF, or Adobe PDF. This will create a pdf file that spans multiple pages.”
Specifically: 1. Knit your file 2. Click open in browser (top right of the output generated) or locate the file in your directory and open it directly. 3. Print the webpage (Cmd + ‘P’ on Mac or Ctrl + P on windows). 4. Select “Save as PDF”
This is what the ideal process would be, but we do not require it for the first homework. If you would like to create PDF documents from R Markdown, you will need to have a LaTeX distribution installed. Although there are several traditional options including MiKTeX, MacTeX, and TeX Live,you can also install TinyTeX.
install.packages('tinytex')
tinytex::install_tinytex()
After you have installed the package, you should restart Rstudio. To restart RStudio, go to the Session tab and select “Restart R”. After you’ve restarted, create a new R Markdown document by selecting the new file button at the top right. After you’ve selected that button, choose the R Markdown option, select “PDF” as your output format, and use the default settings of the R Markdown Document.
Note: You will get a .pdf popup. Do not print this popup. In the same folder where your Rmarkdown file lives, there will be a .pdf document that you can use.
We will now go over some aspects of using R.
Use R as a calculator:
1 + 2
## [1] 3
# 567*7
# 5/2
Often, we want to store the result of a computation so that we can use it later. R allows us to do this by variable assignment. Variable names must start with a letter and can only contain letters, numbers, _ and ..
x <- 2
Notice that no output was printed. This is because the act of variable assignment doesn’t produce any output. If we want to see what x contains, simply key its name into the console:
x
## [1] 2
Note that x has been added to the environment (top
right).
You can use rm() to remove variables from the
environment
rm(x)
The most common data types in R are numeric, character, logical and
factor. You can check the type using typeof() function.
typeof("Hi")
## [1] "character"
typeof(1)
## [1] "double"
typeof(TRUE)
## [1] "logical"
We can use numeric variables in computations:
a <- 3
b <- 2
a^2 + 3*b
## [1] 15
We can reassign a to a different value. This new value
may be defined using a.
a <- a + 1
a
## [1] 4
For data analysis, we often have to work with multiple values at the same time. There are a number of different R data structures which allow us to do this. The vector is a 1-dimensional array whose entries are the same type. For example, the following code produces a vector containing the character data “Stanford”, “Harvard” and “Columbia”:
universities <- c("Stanford", "Harvard", "Columbia")
universities
## [1] "Stanford" "Harvard" "Columbia"
Certain vectors can be produced using short cuts. To produce a vector with the numbers 1,2,3,4,5 we can use 1:5 instead of c(1,2,3,4,5).
numbers <- 1:5
numbers
## [1] 1 2 3 4 5
You produce very large vectors quickly using this.
1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
R allows us to access individual elements in a vector. Unlike many other programming languages, indexing begins at 1, not 0. For example, to return the first entry in the vector universities, I would use the following code:
universities[1]
## [1] "Stanford"
To get more than one entry from a vector you cannot just write more than one number inside the square brackets. This will return an error
universities[1,3]
To get more than one entry, you need to but a vector inside the square brackets. So you need to write c(1,3) instead of 1,3.
universities[c(1,3)]
## [1] "Stanford" "Columbia"
In vectors, the elements have to be of the same type. To have elements with different types in one data structure, we can use a list.
Lists are made in the same way as vectors but you write list() instead of c(). You also have the option of naming each element in the list. Below we make a named list called person.
celebrity <- list(person = "Ryan Reynolds", age = 45)
celebrity
## $person
## [1] "Ryan Reynolds"
##
## $age
## [1] 45
To access the person element in celebrity, write celebrity$person
celebrity$person
## [1] "Ryan Reynolds"
This is the quickest way to do indexing in a list. If the lists are not named you can also do
celebrity[[1]]
## [1] "Ryan Reynolds"
You can also use $ to add new elements to a list. We can add Ryan Reynolds’ job to the list `celebrity’ like so
celebrity$job <- "Actor"
celebrity
## $person
## [1] "Ryan Reynolds"
##
## $age
## [1] 45
##
## $job
## [1] "Actor"
The elements of a list can be anything, even another data structure (vector, matrix, etc.)! We can add a vector with the names of Ryan Reynolds’ children to the list celebrity.
celebrity$children <- c("James", "Inez", "Betty")
celebrity
## $person
## [1] "Ryan Reynolds"
##
## $age
## [1] 45
##
## $job
## [1] "Actor"
##
## $children
## [1] "James" "Inez" "Betty"
Our list is getting quite long. The str() function (short for structure) provides a useful summary of the overall structure of a complex data structure. Running str(celebrity) gives a more condensed view of our list
str(celebrity)
## List of 4
## $ person : chr "Ryan Reynolds"
## $ age : num 45
## $ job : chr "Actor"
## $ children: chr [1:3] "James" "Inez" "Betty"
To see the names associated with a list, use the names() function:
names(celebrity)
## [1] "person" "age" "job" "children"
As with vectors, elements of matrices have to be of the same type.
We use the matrix() command to change the vector c(1,2,3,4,5,6) into a matrix:
A <- matrix(c(1,2,3,4,5,6), nrow = 2)
A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Notice that R takes the elements in the vector you give it and fills
in the matrix column by column. If we want the elements to be filled in
by row instead, we have to put in a byrow = TRUE
argument:
B <- matrix(c(1,2,3,4,5,6), nrow = 2, byrow = TRUE)
B
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
To get the dimensions of the matrix, we can use the dim(), nrow() and ncol() functions.
dim(B)
## [1] 2 3
nrow(B)
## [1] 2
ncol(B)
## [1] 3
To access the element in the ith row and
jth column for the matrix B, use the index
i,j:
B[1, 2]
## [1] 2
You install packages using install.packages(). Here, we
will use the package tidyverse and
palmerpenguins (for penguin data). To install packages you
use install.packages().
install.packages(c("tidyverse", "palmerpenguins"))
Once you have installed a package, every time you want to use it you need to load the package:
We use `ggplot() for plotting. This is included in the
tidyverse() package.
Let’s first look at the penguins data. Summary allows you to get a summary of each variable:
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Using head shows you the first few rows of the dataset:
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
To find out the dimensions of the data, use:
dim(penguins)
## [1] 344 8
ncol(penguins)
## [1] 8
nrow(penguins)
## [1] 344
If we are interested in a particular variable, we can use:
summary(penguins$sex)
## female male NA's
## 165 168 11
summary(penguins$bill_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.10 39.23 44.45 43.92 48.50 59.60 2
Let’s plot body mass as a function of flipper length:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Note also that a warning is printed to let us know that there are NAs in our data set. A warning is different to an error. A warning lets us know that something might have gone wrong but our code still runs. An error means that our code failed to run.
Let’s now create a scatter plot with flipper length on the x axis,
body mass on the y axis and points colored by sex. To do this, we can
use the same code as before but add the mapping color = sex
inside aes().
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
By default, ggplot will use the column names to name the
axis and legend in plots. By default the plots do not have titles. Both
of these can be changed by using the command labs() (short
for labels).
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species),
size = 2.5, alpha = 0.7) +
labs(title = "Penguin size", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
We want to make a histogram of flipper length, colored by species. To
change the color on the inside of the bars, we need to use the aesthetic
fill.
ggplot(data = penguins) +
geom_histogram(mapping = aes(x = flipper_length_mm, fill = species)) +
labs(
x = "Flipper length (mm)",
y = "Count",
fill = "Species"
)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
Note that when we changed fill, the bars stayed the same
height. The Adelie bars where stacked on top of the Chinstrap bars and
both were on top of the Gentoo bars. If we do not want the bars stacked
on top of each other, we can an optional argument of
geom_histogram() called position. If we set
position = “identity”, then instead of a stacked histogram we will get
an individual histogram for each species.
ggplot(data = penguins) +
geom_histogram(mapping = aes(x = flipper_length_mm, fill = species),
position = "identity") +
labs(
x = "Flipper length (mm)",
y = "Count",
fill = "Species"
)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
Another way we can compare the distribution of
flipper_length_mm across the different species is by using
facets. We can split the above plot into three smaller plots called
facets. To do this we can use facet_wrap(). We have to use
a tilde ~ and then the name of the variable we want to facet by. Here we
add a facet to the histogram of flipper_length_mm.
ggplot(data = penguins) +
geom_histogram(mapping = aes(x = flipper_length_mm, fill = species), show.legend = FALSE) +
facet_wrap(~ species) +
labs(
x = "Flipper length (mm)",
y = "Count"
)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
A bar graph can be used to summarize a categorical variable. Here is
a bar graph of the variable island.
ggplot(data = penguins) +
geom_bar(mapping = aes(x = island)) +
labs(
x = "Island",
y = "Count"
)
We can also use bar plots to show the interaction between two
categorical variables. One way to do this is with faceting. Let’s add a
facet based on species and also fill by
species.
ggplot(data = penguins) +
geom_bar(mapping = aes(x = island, fill = species),
show.legend = FALSE) +
facet_wrap(~species) +
labs(
x = "Island",
y = "Count"
)
We want to show the interaction between how the interaction between
species, sex, flipper_length_mm
and body_mass_g.
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species, shape = sex),
size = 2.5, alpha = 0.7)
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
Unfortunately, the different shapes are still a bit confusing. We might get a clearer result if we facet by species and color by sex.
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species),
size = 2.5, alpha = 0.7, show.legend = FALSE) +
facet_wrap(~ species) +
labs(
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
### Changing the theme
All of our plots have been using the default ggplot theme. By adding a new layer we change the theme. Here are a few examples:
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) +
facet_wrap(~ species) +
theme_minimal() +
labs(
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) +
facet_wrap(~ species) +
theme_linedraw() +
labs(
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
You can manually set the colors by adding the layer
scale_color_manual().
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = sex)) +
facet_wrap(~ species) +
scale_color_manual(values = c("darkorange", "cyan4"), labels = c("Female", "Male", "Unknown")) +
theme_minimal() +
labs(
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
You can change the shape of the points, see here: https://r4ds.had.co.nz/data-visualisation.html
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, fill = sex, shape = sex), size = 3) +
facet_wrap(~ species) +
scale_fill_manual(values = c("darkorange", "cyan4")) +
scale_shape_manual(values = c(21, 22)) +
theme_minimal() +
labs(
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).
One easy way to import data is clicking “File > Import Dataset”
and follow the suggestions from there. Functions you can use are, for
example, read_csv(). This will import the data and paste
the code for it in the console. You can copy that code from the
console.
You can see which directory R is using by default to import data by
looking at the bottom right and the Files pane, or you can run
getwd() to see the working directory or
setwd() to set it.
We will use data from New York’s Department of Sanitation below. The data was downloaded from here https://data.cityofnewyork.us/City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5.
NY_waste <- read.csv("NY waste.csv")
head(NY_waste)
## date borough district type value
## 1 2005-01-01 Manhattan 09 refuse 3006.0
## 2 2005-01-01 Manhattan 09 paper 276.9
## 3 2005-01-01 Manhattan 09 mgp 180.3
## 4 2005-01-01 Manhattan 09 resident_organics 0.0
## 5 2005-01-01 Manhattan 09 school_organics 0.0
## 6 2005-01-01 Manhattan 09 leaves 0.0
This data set contains information about the amount of waste collected in different locations in New York city. The first column specifies the date of collection. The next two columns specify the borough and district (location). The next column specifies the type of rubbish collection ranging from refuse (general waste) to Christmas trees.
We use filter() to subset rows and select()
to choose specific columns. Here, to demonstrate we use the penguins
data again.
# Filter for "Adelie" species and select specific columns
adelie_penguins <- penguins %>%
filter(species == "Adelie") %>%
select(species, island, bill_length_mm, bill_depth_mm)
head(adelie_penguins)
## # A tibble: 6 × 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 19.3
## 6 Adelie Torgersen 39.3 20.6
The mutate() function is used to create or transform
columns. Here, we create a new column called bill_ratio (bill length
divided by bill depth).
penguins <- penguins %>%
mutate(bill_ratio = bill_length_mm / bill_depth_mm)
head(penguins)
## # A tibble: 6 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>
We can use summarize() to summarize data. We can
summarize by group for example, if we use group_by() as
well.
penguin_summary <- penguins %>%
group_by(species) %>%
summarise(
avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
count = n()
)
penguin_summary
## # A tibble: 3 × 4
## species avg_bill_length avg_bill_depth count
## <fct> <dbl> <dbl> <int>
## 1 Adelie 38.8 18.3 152
## 2 Chinstrap 48.8 18.4 68
## 3 Gentoo 47.5 15.0 124
The pivot_longer() function converts the dataset from
wide to long format. Here, gathering bill measurements into a single
column.
penguins_long <- penguins %>%
pivot_longer(
cols = c(bill_length_mm, bill_depth_mm),
names_to = "measurement",
values_to = "value"
)
head(penguins_long)
## # A tibble: 6 × 9
## species island flipper_length_mm body_mass_g sex year bill_ratio
## <fct> <fct> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 181 3750 male 2007 2.09
## 2 Adelie Torgersen 181 3750 male 2007 2.09
## 3 Adelie Torgersen 186 3800 female 2007 2.27
## 4 Adelie Torgersen 186 3800 female 2007 2.27
## 5 Adelie Torgersen 195 3250 female 2007 2.24
## 6 Adelie Torgersen 195 3250 female 2007 2.24
## # ℹ 2 more variables: measurement <chr>, value <dbl>
Conversely, use pivot_wider() to convert from long to
wide.
The lubridate package, which is part of
tidyverse is designed to handle dates in R. Here is an
example:
ymd(c("20160927", "20160230"))
## Warning: 1 failed to parse.
## [1] "2016-09-27" NA
mdy(c("6/12/16", "2/9/16"))
## [1] "2016-06-12" "2016-02-09"
dmy(c("1/9/2016", "26/9/16"))
## [1] "2016-09-01" "2016-09-26"
Let’s use the waste dataset to look at trends over time across all of
New York city.To do this we should group by date and type and use
summarize() to add up the amount of waste across all
districts.
aggregate_waste <- NY_waste %>%
group_by(date, type) %>%
summarize(amount = sum(value))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
aggregate_waste
## # A tibble: 1,498 × 3
## # Groups: date [214]
## date type amount
## <chr> <chr> <dbl>
## 1 2005-01-01 leaves 0
## 2 2005-01-01 mgp 18156
## 3 2005-01-01 paper 28658.
## 4 2005-01-01 refuse 208304.
## 5 2005-01-01 resident_organics 0
## 6 2005-01-01 school_organics 0
## 7 2005-01-01 xmas_trees 1376.
## 8 2005-02-01 leaves 0
## 9 2005-02-01 mgp 18071.
## 10 2005-02-01 paper 29118
## # ℹ 1,488 more rows
Let’s calculate the year by month:
monthly_waste <- aggregate_waste %>%
mutate(month = month(date),
year = year(date)) %>%
group_by(month, type) %>%
summarize(amount = mean(amount)) %>%
ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
ggplot(monthly_waste) +
geom_line(aes(x = month, y = amount, color = type)) +
scale_x_continuous(breaks = (1:12))
Nearly all of the waste collected in our data set is “refuse”, “mgp” (metal, glass, plastic) or “paper”. This makes it hard to see what’s going on with the other values.
monthly_waste %>%
filter(!(type %in% c("refuse", "mgp", "paper"))) %>%
ggplot(aes(x = month, y = amount, color = type)) +
geom_line() +
scale_x_continuous(breaks = (1:12))
We can see a lot of seasonal affects in the last two graphics. Three that stand out are:
Let’s say we are interested in seeing whether there are any patter by whether it was the covid year (2020) or not (previous).
refuse_2020_b <- NY_waste %>%
mutate(year = year(date),
month = month(date),
is_2020 = (year == 2020)) %>%
filter(year <= 2020,
type == "refuse") %>%
group_by(borough, month, is_2020) %>%
summarize(amount = mean(value))
## `summarise()` has grouped output by 'borough', 'month'. You can override using
## the `.groups` argument.
ggplot(refuse_2020_b) +
geom_line(aes(x = month, y = amount, color = is_2020)) +
facet_wrap(~borough) +
labs(x = "Month", y = "Amount", color = "2020?") +
scale_x_continuous(breaks = seq(1, 12))
You can calculate correlations using cor(). You can fit
a linear model using lm().
# Fit a linear model: body_mass_g ~ flipper_length_mm
lm_model <- lm(body_mass_g ~ flipper_length_mm, data = penguins)
summary(lm_model)
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.80 -259.27 -26.88 247.33 1288.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394.3 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
We also plot the linear model on a scatter plot:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(
title = "Linear Model: Body Mass vs. Flipper Length",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).