TDM 20100: Project 3 — 2023
Motivation: The need to search files and datasets based on text is common during various parts of the data wrangling process. As an example, grep
is a powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated. (Even professionals can make critical mistakes.) With that being said, learning some of the basics will come in handy, regardless of the language in which you are using regular expressions.
Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools, you should definitely take the time to learn regular expressions. |
Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep
, regular expression basics, utilizing regular expression tools in R and Python
Dataset(s)
The following questions will use the files in this directory:
-
/anvil/projects/tdm/data/consumer_complaints/
and, in particular, several questions will focus on the data in this file:
-
/anvil/projects/tdm/data/consumer_complaints/processed.csv
Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. You may use 'grep' command to search for the dataset. You can start in the
Just know that you’d eventually find the file without using the wildcard, but we don’t want to waste your time. |
Use
|
When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique part of the sentence that will likely not exist in another file. |
Question 1 (1 pt)
-
Write a
grep
command that finds the dataset, which contains text "朝阳区" in all directories that start withair
inside the/anvil/projects/tdm/data
directory. As with the example given above, you search should be case-insensitive, and your needs to display the line numbers for the location of the text.
Question 2 (1.5 pts)
-
Use the
head
command to print out the first line only from the file/anvil/projects/tdm/data/consumer_complaints/processed.csv
.Using the
head
command, we can (in general) quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds.head -n numberoflines filename
-
Print out first 5 lines from 3 columns, namely:
Date Received
,Issue
andCompany response to consumer
from the file/anvil/projects/tdm/data/consumer_complaints/processed.csv
Use the
cat
command to view all file contents, thehead
to control the row, and thecut
command to select columns.cat filename | head -n rowNumbers | cut -d 'delimiterhere' -f field1,field2,...
-
For the single line where we heard about the "That’s the sort of fraudy fraudulent fraud", print out these 4 columns:
Date Received
,Issue
,Consumer complaint narrative
, andCompany response to consumer
.
Use You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the
|
Question 3 (2 pts)
-
From the file
/anvil/projects/tdm/data/consumer_complaints/processed.csv
, use a one line statement to create a new dataset calledmidwest.csv
that has the following requirments:-
it will only contains the data for these five states: - Indiana (IN), Ohio (OH), Illinois (IL), Wisconsin (WI), and Michigan (MI)
-
it will only the contain these five columns:
Date Received
,Issue
,Consumer complaint narrative
,Company response to consumer
, andstate
-
Be careful that you don’t accidentally get lines with a word like "AGILE" in them (IL is the state code of Illinois and is present in the word "AGILE").
-
Use '>' redirection operator to create the new file, e.g.,
createthefile > midwest.csv
-
-
-
Please describe how many rows of data are in the new file, and find the size of the new file in megabytes
|
520953
520953 /home/x-nzhou1/midwest.csv
Question 4 (1.5 pt)
-
Use grep command to get information from the new data set 'midwest.csv' to find the number of rows that contain one (or more) of the following words (the search is case-insensitive): "improper", "struggling", or "incorrect".
Question 5 (2 pts)
-
In the file
/anvil/projects/tdm/data/consumer_complaints/processed.csv
, which date appears the most in theDate received
column? -
In the file
/anvil/projects/tdm/data/consumer_complaints/processed.csv
, for each category ofProduct
, how many times does that type product appear in the data set?
Project 03 Assignment Checklist
-
Code used to solve quesiton 1 to 5
-
Output from running the code
-
Copy the code and outputs to a new Python File
-
firstname-lastname-project03.ipynb
.
-
-
Submit files through gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |