wiki:cs122d-2021-spring

CS 122D Spring 2021: Beyond SQL Data Management

Course Personnel

Instructor:

TAs:

Reader:

Important Times & Places

Weekly Lectures:
Time: MW 3:30-4:50 PM
Place: Yuja + Piazza
Instructor: Mike

Discussion 1:
Time: W 8:00- 8:50p
Place: Zoom + Piazza
TA: Kyle

Discussion 2:
Time: Th 7:00- 7:50p
Place: Zoom + Piazza
TA: Nada

Midterm Exam:
Time: Monday May 3, MW 3:30-4:50 PM
Place: Gradescope + Piazza

Final Exam:
Time: Monday June 7, 4:00-6:00 PM
Place: Gradescope + Piazza

Course Overview

The relational data model dates back to 1970, and relational database systems have been the dominant commercial technology for data management and data-centric application development for over three decades now. The explosion of the web, social media, and the Internet of Things are now bringing new data with challenging new requirements. This new data is richer and more diverse in structure, and in addition it is arriving in unprecedented quantities. In response, new data management technologies are appearing and being used in practice to handle today’s new data challenges.

This course surveys the new data formats and new data management and data analysis technologies that are being used for post-relational data management. Data formats covered will include those targeting semistructured data (e.g., JSON). Data management approaches surveyed will include various NoSQL technologies such as key-value stores, document database systems, and graph database systems. Surveyed data analysis technologies will include big data engines such as Hadoop, Spark, and SQL-on-Hadoop; big text search and analytics engines may also be included. Notebooks and data frames may be covered as well, time permitting, as may stream data managers and/or time-series databases. (So many systems, so little time!)

The course will be structured as a series of modules on the post-relational technologies, with chosen technologies being experienced through associated hands-on homework assignments. The programming language used to interact with the different systems, where needed, will be Python. Students will gain a comprehensive understanding of the basic concepts and system alternatives for post-relational data management as well as gaining first-hand experience with selected systems from the surveyed technology categories. The course is designed to be a follow-on to CS122A for those who are interested in learning about post-relational data management technologies.

Prerequisites

ICS 46, ICS 51, and CS 122A (or EECS 116) all with a grade of C or above. (B or above is strongly recommended.)

Textbooks

None! At least not the expensive kind. (Just NoSQL Distilled, an inexpensive professional book that can be read online for free though the UCI Library.) Nearly all of the materials will be accessible online for UCI students in some form, as there is no good textbook on this material today. Lecture watching is mandatory; students will be held responsible for any/all information given in the lectures whether or not it appears in the course notes or elsewhere. Because we are still in "pandemic mode" at UCI, and this is a fairly big class (150 students), the lectures will be made available asynchronously in video form and the instructor will meet with the class on Piazza for real-time Q&A during the lecture time slots. The plan will be for the lecture videos and slides to be available online by noon on class days (at least when things are going according to plan :-)) so that they can be watched ahead of the appointed Piazza time if desired.

Lecture Plan

Date Topic Relevant Material
M 3/29 Post-relational “escape attempts” Ch. 1-2 NoSQL Distilled, Introducing JSON website
W 3/31 Relational databases and SQL (and “beyond”) SQL chapters of any DB textbook, Advanced Aggregation excerpt
M 4/05 Logical DB design and E-R modeling E-R chapter of any DB textbook
W 4/07 Scaling RDBMSs through parallelism Parallel RDBMS paper
M 4/12 Key-value stores: Architecture & consistency Baseball paper, Ch. 4-6, 8 NoSQL Distilled, Abadi paper
W 4/14 Column-family stores: BigTable, Cassandra Ch. 10 NoSQL Distilled (old!)
M 4/19 Column-family stores: Cassandra (cont.) Cassandra materials (as needed)
W 4/21 Document stores: JSON and MongoDB Ch. 9 NoSQL Distilled (old!)
M 4/26 Document stores: MongoDB (cont. ) MongoDB materials (as needed)
W 4/28 NoSQL DB design principles Ch. 3 NoSQL Distilled
M 5/03 Midterm Exam (Checkpoint) 3:30-4:50 PM -- be there!!!
W 5/05 Document DBs: Couchbase Server & N1QL CB Analytics paper
M 5/10 Document DBs: Couchbase Server (cont. ) CB materials (as needed)
W 5/12 Graph DBs: Graph modeling & Neo4J Ch. 11 NoSQL Distilled (old!)
M 5/17 Graph DBs: Neo4J (cont.) Neo4J materials (as needed)
W 5/19 Big Data Analytics: Google, MapReduce, HDFS Big Data Platforms paper (skim)
M 5/24 Big Data Analytics: Spark & SparkSQL Spark Overview paper (skim)
W 5/26 Big Data Analytics: Spark & DataFrames Databricks and Spark materials (as needed)
W 6/02 Data Stream Systems: Spark Structured Streaming Databricks and Spark materials (as needed)
Overflow Column stores, search, message streams, timeseries/IoT, ... Google to find out more...
M 6/07 Final Exam (Cumulative) 4:00-6:00 PM -- be there!!!

Homework Plan

HW Available Due Date/Time HW Topic Setup Info Details Solution
HW1 Mo 4/05 Th 4/15 (11:59 PM) SQL Review HW1 Setup HW1 Details Template HW1 Solution
HW2 Th 4/15 Mo 4/26 (11:59 PM) Cassandra HW2 Setup HW2 Details TemplateHW2 Solution
HW3 Mo 4/26 Th 5/06 (11:59 PM) MongoDB HW3 Setup HW3 Details Template HW3 Solution
HW4 Th 5/06 Mo 5/17 (11:59 PM) CB Server HW4 Setup HW4 Details TemplateHW4 Solution
HW5 Mo 5/17 Th 5/27 (11:59 PM) Neo4J HW5 Setup HW5 Details Template HW5 Solution
HW6 Th 5/27 Mo 6/07 (11:59 PM) Spark HW6 Setup HW6 Details Template HW6 Solution

(Note: The links above will all be broken by design until each assignment's "Available" date is reached.)

Readings and References

  1. P. Sadalage and M. Fowler. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence (1st Edition). Addison-Wesley Professional, 2012. (https://martinfowler.com/books/nosql.html)
  2. Introducing JSON, https://www.json.org/json-en.html.
  3. A. Silbershatz, H. Korth, and S. Sudarshan, Ch. 5.5, Database System Concepts, McGraw-Hill, 2020), Advanced Aggregation in SQL.
  4. D. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. CACM 35, 6, June 1992. (Parallel RDBMS)
  5. A. Davoudian, L. Chen, and M. Liu. 2018. A Survey on NoSQL Stores. ACM Comp. Surv. 51, 2, April 2018. (NoSQL Survey) (*Optional*)
  6. D. Terry. Replicated Data Consistency Explained Through Baseball. CACM 58, 12, Dec. 2013. (Baseball Paper)
  7. AsterixDB 101: An ADM and SQL++ Primer, https://ci.apache.org/projects/asterixdb/sqlpp/primer-sqlpp.html.
  8. DataStax Fundamentals, https://www.datastax.com/learn/cassandra-fundamentals.
  9. DataStax Documentation, https://docs.datastax.com/.
  10. MongoDB Manual, https://docs.mongodb.com/manual/.
  11. MongoDB University, https://university.mongodb.com/.
  12. Couchbase Docs, https://docs.couchbase.com/.
  13. Couchbase Academy, https://learn.couchbase.com/.
  14. D. Chamberlin, SQL++ for SQL Users: A Tutorial, Couchbase, Inc., 2018. (SQL++ https://asterixdb.apache.org/files/SQL_Book.pdf)
  15. M.Hubail et al. Couchbase Analytics: NoETL for Scalable NoSQL Data Analysis. PVLDB 12, 12, Aug. 2019. (CB Analytics)
  16. Graph Academy, https://neo4j.com/graphacademy/.
  17. Neo4J Documentation, https://neo4j.com/docs/.
  18. S. Babu and H. Herodotou (2013), Massively Parallel Databases and MapReduce Systems, Foundations and Trends in Databases 5, 1, Nov. 2013. (Big Data Platforms) (*Optional*)
  19. M. Zahari et al. Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM 59, 11, Oct. 2016. (Spark Overview)
  20. Databricks Documentation, https://docs.databricks.com/.
  21. Databricks Academy, https://academy.databricks.com/.
  22. Databricks, Inc., The Data Engineer’s Guide to Apache Spark, 2019.
  23. Spark Structured Streaming Programming Guide, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.
  24. D. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design. IEEE Computer 45, 2, Feb. 2012. (https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf)
  25. Simplilearn, Introduction to Kafka, https://youtu.be/U4y2R3v9tlY?list=WL/. (*Optional*)
  26. Elasticsearch: Getting Started, https://www.elastic.co/webinars/getting-started-elasticsearch/. (*Optional*)
  27. Elasticsearch Reference, https://www.elastic.co/guide/en/elasticsearch/reference/current/. (*Optional)
  28. Inside the InfluxDB Storage Engine, CMU DB Group Time Series DB Lectures, https://youtu.be/2SUBRE6wGiA?list=WL/. (*Optional*)

Policy Stuff

Grading Criteria and Workload

This course will include some quizzes, one midterm exam, one final exam, and a series of homework assignments appearing every 10 days or so that involve the hands-on use of the surveyed data management technologies. Because this is classified as a project course, grading will be HW-biased and will be based on a combination of the quizzes and two exams (5% + 10% + 20% = 35%), homework assignments (60%), and in-class participation (5%). The workload will likely be comparable to that of CS122A, when this instructor teaches it, or at least that is the intent. (Note: Since this is a first official offering of CS122D, the instructor reserves the right to modify the exact weights once we see how the assignments and exams/quizzes are working. You may even be asked for input on that as the quarter progresses!)

Homework and Participation

Homework assignments are to be turned in by the assigned due dates/times. Details of how to turn in a given assignment will be included in each assignment's handout. The goal for the course will be for students to leave with hands-on experience with each of a variety of post-SQL data management technologies, and the homework assignments will be designed for that purpose.

In terms of the portion of the grade based on class participation, your participation points will be based on participating in some "How are things going?" surveys as the quarter progresses and on your participation on Piazza.

Collaboration Policy

Homework assignments are to be turned in individually, but you are encouraged to pair up with a fellow student -- someone to serve as your brainstorming buddy -- for the duration of the quarter. It is okay to discuss assignments with other peers as well, e.g., to clarify the interpretation of a question or to compare thoughts on very rough approaches, but discussions of the details of your work are to stay within your team of two. You should pick this brainstorming partner at the start of the term and then stay with that same person for the remainder of the quarter. The quizzes and exams are to be done solo, but will be open book, open notes, and open manual. (That's how you will work in the "real world", and prevention of such work during exams is impractical IMO in pandemic times.)

Academic Honesty Policy

Cheating is an area where the instructor for this course has zero patience or sympathy. You are at UCI to learn, and in this class to learn its material, not to get a grade; cheating defeats those purposes. All students will be bound by the UCI and ICS Academic Honesty policies (see https://aisc.uci.edu/policies/academic-integrity/index.php and http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty). Any student found to be involved in cheating or aiding others in cheating may be academically prosecuted to the maximum extent possible: that means you could even fail the course in its entirety. Just say no to cheating!!!

That being said, the instructor has been pleasantly surprised by upper division UCI students' behavior in pandemic times, at least in CS122A. As a result, he will largely be trusting students to behave with integrity and to be honest and self-policing. Again, you are presumably in this class because of the knowledge you want to gain -- we will offer it, as well as assessments of your performance -- so utilize them fully to learn! There is nothing of value to be gained by cheating, only a lost learning opportunity.

Grade Change Policy

For all of the graded assignments as well as the midterm exams, if you disagree with the grading, you may discuss your concerns with the relevant instructor (professor or TA) within two weeks after they are returned. After that, all grades will be considered final. Gradescope's regrade request feature should be used to initiate activity in such cases.

Late Homework Policy

Due dates will be indicated on all HW assignments. Assignments will be accepted up to two days (48 hours) after the due date, but you will lose 5 points per day (out of a 100-point total) for lateness. We will not accept assignments after that time; we need to keep moving in order to make it to the end of the planned material! "Stuff happens" sometimes, so you should anticipate that reality and avoid working until the very last minute. Assignments MUST be turned in per the assignment's instructions by the indicated deadline in order to get full credit.

To ensure that you are able to do the HW assignments on time, you are strongly encouraged to have each HW assignment's setup instructions completed no later than 7 days before its due date! You will be wrestling with a number of software packages and cloud accounts, and it will simply be "your bad" if you wait longer than that to ensure that you have the infrastructure for each HW assignment in place! (The TAs will be available to help with setup problems early on, but they will not be expected to rescue students with last-minute setup glitches.)

Exam Attendance Policy

The exam dates have been provided (see above!) at the start of the term. These dates will not be flexible and makeup exams will not be offered, so you should avoid scheduling interviews, mini-vacations, or any other activities in ways that interfere with exam-taking. All students will be taking the exams on the same dates in the same time windows in order to ensure that we can give, grade, and fairly curve the entire class on each exam.

Discussion Forums for All Things CS122D

We will use Piazza (heavily!) for online class discussions. Piazza aims to get you the help you need fast and efficiently -- from classmates, the TAs, and the instructor. Instead of emailing lecture or HW questions to the teaching staff, post them on Piazza. Piazza participation will contribute to your overall grade -- and misbehavior on Piazza will lead to a loss of points. You can sign up for Piazza at http://piazza.com/uci/spring2021/cs122d. (You should sign up and monitor Piazza even if you are on the waiting list for the class -- in fact, if you are on the waiting list, you should do all of the work as though you are in the class in order to be in a position to be admitted if some students drop.)

When using Piazza, there are a few things to keep in mind. First, despite the national trend towards the acceptance of nastiness in social media, hurtful Piazza behavior will not be tolerated -- be kind to your classmates. Second, Piazza is a great resource for asking and answering questions when used thoughtfully. You are to avoid re-asking questions that have already been asked and answered -- so you are responsible for reading others' questions and not re-asking them. With many eyes on each message, lazy question-asking on Piazza is costly, inconsiderate, and can lead to a loss of class participation points. Finally, Piazza is not a place to discuss the details of your answers to HW problems -- so it is not a place to post, request, or compare answers until after the due date plus 48 hours have passed. (Doing so risks violating the Academic Honesty Policy, and you have your brainstorming buddy for such discussions.) Last but not least, Piazza is a great place to post any additional info that you happen to come across on the Web that you think other students may find helpful -- so feel free to post any gems that you find in your web travels for this class.

Lecture and Discussion Videos

Lecture videos (and discussion session recordings as well) can be accessed through Canvas or by going to the Spring 2021 CS122D class video channel on Yuja at https://uci.yuja.com/V/PlayList?node=9933576&a=1583628376&autoplay=1. The lecture videos follow the schedule in the Lecture Plan table above, and the channel link will take you to the latest video (with the whole collection of videos being available in the panel on the right-hand side of the screen).

Last modified 3 weeks ago Last modified on May 27, 2021 8:41:30 PM

Attachments (70)