TDM 40100: Project 2 — 2022

Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3 database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3, with the opportunity to use other skills you’ve built throughout the previous years.

Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3 — a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.

Scope: sqlite3, lmod, SQL

Learning Objectives
  • Create your own sqlite3 database file.

  • Analyze a large dataset and formulate CREATE TABLE statements designed to store the data.

  • Insert data into your database.

  • Run one or more queries to test out the end result.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json

Questions

Question 1

The goodreads dataset has a variety of files: /anvil/projects/tdm/data/goodreads/original. With that being said there are 4 files which hold the bulk of the data. The rest is mostly derivitives of those 4 files.

  • /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json

Take a look at the 4 files included in this dataset. How many bytes of data total do the 4 files take up on the filesystem?

You can use du in a bash cell to get this information.

Approximately how many books and how many reviews are included in the datasets?

Finally, take a look at the first book.

{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"}

As you can see, there is an image_url included for each book. Use bash tools to download one of the images to $HOME/p02output. How much space does it take up (in bytes)?

Use wget to download the image. Rather than using cd to first navigate to $HOME/p02output before using wget to download the image, instead, use a wget option to specify the directory to download to.

It is okay to manually copy/paste the link from the json.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

We decided we want to download more than 1 image in order to approximate how much space the images take on average.

In the previous question we said it was okay to manually copy/paste the image_url — this time, you probably won’t want to do that. You can use a bash tool called jq to extract the links automatically. jq is located /anvil/projects/tdm/bin/jq.

The --raw-output option to jq will be useful as well.

Use bash tools (and only bash tools, from within a bash cell) to download 25 random book images to $HOME/p02output, and calculate the average amount of space that each image takes up. Use that information to estimate how much space it would take to store the images for all of the book in the dataset.

Take a look at the shuf command in bash: man shuf.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Okay, so roughly, in total, we are looking at around 34 gb of data. With that size it will definitely be useful for us to create a database. After all, answering questions like:

  • What is the average rating of Brandon Sandersons books?

  • What are the titles of the 5 books with the most number of ratings?

  • Etc.

Is not very straightforward if we handed you this data and said "get that info please", but, if we had a nice sqlite database — it would be easy! So let’s start planning this out.

First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it.

Use shuf to get a random sample of the goodreads_books.json and goodreads_reviews_dedup.json datasets. Approximate how many rows you’d need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of goodreads_book_authors.json and goodreads_book_series.json in $HOME/goodreads_samples.

It just needs to be approximately 100mb — no need to fuss, as long as it is within 50mb we are good.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Check out the 5 storage classes (which you can think of as types) that sqlite3 uses: www.sqlite.org/datatype3.html

In a markdown cell, write out each of the keys in each of the json files, and list the appropriate storage class to use. For example, I’ve provided you an example of what we are looking for, for the goodreads_reviews_dedup.json.

  • user_id: TEXT

  • book_id: INTEGER

  • review_id: TEXT

  • rating: INTEGER

  • review_text: TEXT

  • date_added: TEXT

  • date_updated: TEXT

  • read_at: TEXT

  • started_at: TEXT

  • n_votes: INTEGER

  • n_comments: INTEGER

You don’t need to copy/paste the solution for goodreads_reviews_dedup.json since we provided it for you.

You do not need to assign a type to the following keys in goodreads_books.json: series, popular_shelves, similar_books, and authors.

  • Assume isbn, asin, kindle_asin, isbn13 columns could start with a leading 0.

  • Assume any column ending in id could _not start with a leading 0.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Please include the CREATE TABLE statements in code cells for this question, but realize that you will have to pop open a terminal and launch sqlite3 to complete this problem.

To do so run the following in the new terminal.

module use /anvil/projects/tdm/opt/core
module load tdm
module load sqlite/3.39.2

sqlite3 my.db # this will create an empty database

You will then be inside a sqlite3 session and able to run sqlite-specific dot functions (which you can see after running .help), or SQL queries.

For now, let’s ignore the "problematic" columns in the goodreads_books.json dataset (series, popular_shelves, similar_books, and authors).

Translate the work you did in the previous question to 4 CREATE TABLE statements that will be used to create your sqlite3 database tables. Check out some examples here. For now, let’s keep it straightforward — ignore primary and foreign keys, and just focus on building the 4 tables with the correct types. Similarly, don’t worry about any of the restrictions like NOT NULL or UNIQUE. Name your tables: reviews, books, series, and authors.

Once you’ve created your CREATE TABLE statements, create a database called my.db in your $HOME directory — so $HOME/my.db. Run your CREATE TABLE statements, and, in your notebook, verify the database has been created properly by running the following.

%sql sqlite:////home/x-kamstut/my.db # change x-kamstut to your username
%%sql

SELECT sql FROM sqlite_master WHERE name='reviews';
%%sql

SELECT sql FROM sqlite_master WHERE name='books';
%%sql

SELECT sql FROM sqlite_master WHERE name='series';
%%sql

SELECT sql FROM sqlite_master WHERE name='authors';
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.