cut
Basics
cut
breaks a vector into factors specified by the argument breaks
. cut
is particularly useful to break Date data into quarters (Q1, Q2), years (1999, 2000, 2001), and so on.
The utility of this function is tied to the possible factors offered by breaks
. You can see a list of your options by running ?cut.POSIXt
.
Examples
How can I create a new column in a data.frame df
that is a factor based on the year?
Click to see solution
df$year <- cut(df$times, breaks="year")
str(df)
'data.frame': 24 obs. of 3 variables: $ times: POSIXct, format: "2020-06-01 06:00:00" "2020-07-01 06:00:00" ... $ value: int 48 62 55 4 83 77 5 53 68 46 ... $ year : Factor w/ 3 levels "2020-01-01","2021-01-01",..: 1 1 1 1 1 1 1 2 2 2 ...
How can I create a new column in a data.frame df
that is a factor based on the quarter?
Click to see solution
df$quarter <- cut(df$times, breaks="quarter")
str(df)
'data.frame': 24 obs. of 4 variables: $ times : POSIXct, format: "2020-06-01 06:00:00" "2020-07-01 06:00:00" ... $ value : int 48 62 55 4 83 77 5 53 68 46 ... $ year : Factor w/ 3 levels "2020-01-01","2021-01-01",..: 1 1 1 1 1 1 1 2 2 2 ... $ quarter: Factor w/ 9 levels "2020-04-01","2020-07-01",..: 1 2 2 2 3 3 3 4 4 4 ...
Video Example: fars
6-hour intervals
Click to see example
Let’s load up the 7581 data set and look at the HOUR
column
myDF <- read.csv("/depot/datamine/data/fars/7581.csv")
table(myDF$HOUR)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17704 18671 17262 9908 6438 5463 6749 7088 6308 6275 7311 8401 8929 9872 12066 14138
We can break these values into 6-hour intervals using cut
:
table( cut(myDF$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=TRUE) )
[0,6] (6,12] (12,18] (18,24] (24,99] 82195 44312 85388 86567 1597
This effectively gives us 5 categories: midnight to 6:00 A.M., 6:01 A.M. to noon, 12:01 P.M. to 6:00 P.M., 6:01 P.M. to midnight, and unknown (99 indicates the hour of day was not included in the entry).
With the help of tapply
, we can find the total number of PERSONS
who are involved in accidents during each 6-hour interval:
tapply( myDF$PERSONS, cut(myDF$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=TRUE), sum )
[0,6] (6,12] (12,18] (18,24] (24,99] 187397 119261 238193 230289 2269