STAT 19000: Project 12 — Fall 2021
Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. lubridate
is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.
Context: We’ve been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.
Scope: r
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
For this project, when launching your Jupyter Lab instance, please select 5000 as the amount of memory to allocate. |
Read the dataset into a dataframe called liquor
.
We are interested in exploring time-related trends in Iowa liquor sales. What is the data type for the column Date
?
Try to run the following code, to get the time between the first and second sale.
liquor$Date[1] - liquor$Date[2]
As you may have expected, we cannot use the standard operators (like + and -) on this type.
Create a new column named date
to be the Date
column but in date format using the function as.Date()
.
From this point in time on, you will have 2 "date" columns — 1 called This allows us to see different ways to work with the data. |
You may need to define the date format in the as.Date()
function using the argument format
.
Try running the following code now.
liquor$date[1] - liquor$date[2]
Much better! This is just 1 reason why it is important to have the data in your dataframe be of the correct type.
Double check that the date got converted properly. The year for |
Relevant topics: read.csv
, fread
, as.Date
, str
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Create two new columns in the dataset called year
and month
based on the Date
column.
Which years are covered in this dataset regarding Iowa liquor sales? Do all years have all months represented?
Use the as.Date
function again, and set the format to contain only the information wanted. See an example below.
Update: It came to our attention that the |
myDate <- as.Date('2021-11-01')
day <- as.numeric(format(myDate,'%d'))
Relevant topics: substr
, as.numeric
, format
, unique
, table
-
Code used to solve this problem.
-
Output from running the code.
Question 3
A useful package for dealing with dates is called lubridate
. The package is part of the famous tidyverse
suite of packages. Run the code below to load it.
library(lubridate)
Re-do questions 1 and 2 using the lubridate
package. Make sure to name the columns differently, for example date_lb
, year_lb
and month_lb
.
Do you have a preference for solving the questions? Why or why not?
Relevant topics: Lubridate Cheat Sheet
-
Code used to solve this problem.
-
Output from running the code.
-
Sentence explaining which method you prefer and why.
Question 4
Now that we have the columns year
and month
, let’s explore the data for time trends.
What is the average volume (gallons) of liquor sold per month? Which month has the lowest average volume? Does that surprise you?
You can change the labels in the x-axis to be months by having the argument |
Relevant topics: tapply
, plot
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences describing your findings.
Question 5
Make a line plot for the average volume sold per month for the years of 2012 to 2015. Your plot should contain 4 lines, one for each year.
Make sure you specify a title, and label your axes.
Write 1-2 sentences analyzing your plot.
There are many ways to get an average per month. You can use |
Relevant topics: plot
, line
, subset
, mean
, sapply
, tapply
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences analyzing your plot.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |