wiki:stats170ab-2019

STAT 170AB Winter/Spring 2019: Project in Data Science

Instructors

Mike Carey
E-Mail: mjcarey@ics.uci.edu
Office: DBH 2091

Vladimir Minin
E-Mail: vminin@uci.edu
Office: DBH 2068

We encourage you to use the Piazza site http://piazza.com/uci/winter2019/stats170a/home for any questions, comments, etc, that you have outside of class hours. We are more likely to quickly respond to questions via Piazza than via email.

Reader

Wail Alkowaileet
E-Mail: walkowai@ics.uci.edu
Office: DBH 2059
Office hours: Friday 12-2pm (Also available after each class)

Meeting Times & Places

Time: Mon/Wed 12:30-1:50 PM
Place: ICS 180


Course Overview

This two-course sequence is intended to be the "grand finale" for Data Science majors. Its goal is to tie together many of the topics that are independently covered in the first 3+ years of Data Science requirements and electives; it also aims to fill in in some of the potential gaps required to solve an end-to-end problem. In addition to a brief review of some of the required skills, the course will cover problem definition and analysis, data representation, algorithm selection, solution validation, and results presentation. Students will do projects, possibly in small teams, while the lecture periods will cover analysis alternatives, project planning, and data analysis issues. The first quarter will emphasize tools and techniques, approach selection, project planning, and experimental design. The second quarter will focus on project execution, data analysis, and presentation of results. Project planning and execution will include the setting of weekly or bi-weekly milestones as well as weekly progress tracking and reporting. The final course deliverables will include both a written project report and an oral presentation of the project results.

Prerequisites

Senior standing and completion of the following courses: STATS 68, STATS 111, IN4MATX 43, COMPSCI 122A, COMPSCI 161, and COMPSCI 178.

Textbooks

Data Wrangling with Python: Tips and Tools to Make Your Life Easier
By Jacqueline Kazil and Katharine Jarmul, O'Reilly Media, 2016.

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd Edition)
By William McKinney, O'Reilly Media, 2017.

Principles of Data Wrangling: Practical Techniques for Data Preparation
By Joseph Hellerstein, Jeffrey Heer, Tye Rattenbury, Sean Kandel, and Connor Carreras, O'Reilly Media, 2017.

Mining the Social Web (2nd Edition, Chapters 1 and 9 in particular)
By Matthew Russell, O'Reilly Media, 2014.

Hands-On Machine Learning with Scikit-Learn and TensorFlow (Chapters 1 through 4 in particular)
By Aurelien Geron, O'Reilly Media, 2017.

(Note: All of these titles are available for free online via the UCI Library's subscription to Safari Books Online (http://proquest.safaribooksonline.com/).

"Fundamentals of Data Visualization" by Claus O. Wilke, 2019. https://serialmentor.com/dataviz/

Project Resources

Predicting Depression from Social Media Updates -- A project sample from W18.

Project Ideas v2 -- A sample of project ideas.

https://www.data.gov/ -- Root page for a variety of US government data.

Final Presentation Guidelines -- Guidelines for your presentations

Final Report Template -- What to include in your final report

Other Resources

https://chrisalbon.com/#articles -- A great collection of relevant how-to and tutorial articles.

AsterixDB Overview -- An overview of Apache AsterixDB (a homegrown NoSQL DB system: http://asterixdb.apache.org/).

SQL++ for SQL Users: A Tutorial -- AsterixDB query language guide book by SQL inventor Don Chamberlin.

https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets -- Twitter search API documentation.

Notes on machine learning from Prof Alex Ihler's CS 178 course: http://sli.ics.uci.edu/Classes/2016W-178.

PostgreSQL Installation Guide using Docker -- A guide to install containeraized PostgreSQL 11.1

Wisconsin Sailing Club loadable dataset for PostgreSQL -- PostgreSQL loadable dataset that appeared in Lecture 2

Anaconda3 Installation Guide using Docker -- A guide to install containeraized Anaconda3 + Jupyter

(And the Attachments section at the bottom of this page in general...)


Topic Coverage and Work Schedule

Winter Quarter Syllabus

Date Lecture Topic Relevant Resources
Mon, 1/07 Course overview and plan This web page & Slides from Lecture 1
Wed, 1/09 Dusting off your databases Your CS122a textbook & Slides from Lecture 2
Mon, 1/14 Data wrangling concepts and issues Hellerstein et al book & Slides from Lecture 3
Wed, 1/16 Wrangling with Pandas and Dataframes I McKinney book (Ch. 4, 5) & Notebook from Lecture 4
Mon, 1/21 (No Class) (No Class)
Wed, 1/23 Wrangling with Pandas and Dataframes II McKinney book (Ch. 7, 8, 10 ) & Notebook from Lecture 5
Mon, 1/28 XML, JSON, Twitter, and Tweepy McKinney book (Ch. 6) & Twitter Search API docs & PostgreSQL Notebook from Lecture 5 & Twitter Notebook from Lecture 6
Wed, 1/30 Semistructured data and SQL vs. NoSQL databases AsterixDB overview paper & Chamberlin book & Slides from Lecture 7
Mon, 2/04 Exploratory data analysis and data visualization Chapters 1-12 of Fundamentals of Data Visualization & Slides from Lecture 8
Wed, 2/06 Exploratory data analysis and data visualization Chapters 13-26 of Fundamentals of Data Visualization & Slides from Lecture 9
Mon, 2/11 Clustering ISLR_unsupervized_learning.pdf & clustering_demo.ipynb
Wed, 2/13 Clustering and PCA ISLR_unsupervized_learning.pdf & pca_demo.ipynb
Mon, 2/18 (No Class) (No Class)
Wed, 2/20 Supervised learning and regression ISLR_regression_classification.pdf
Mon, 2/25 Resampling methods ISLR_resampling.pdf
Wed, 2/27 Project idea meetings
Mon, 3/04 Project planning meetings
Wed, 3/06 Project planning meetings (cont.)
Mon, 3/11 Oral project proposal meetings
Wed, 3/13 Oral project proposal presentations

Spring Quarter Timetable

Week Monday Wednesday
1, April 1 Short intro lecture Joint office hours (review of project feedback), DBH 2091
2, April 8 Project Kick-starter Day Guest speaker 1
3, April 15 Student progress presentations Student progress presentations
4, April 22 Office hours, DBH 2068/2091 Office hours, DBH 2068/2091
5, April 29 Office hours, DBH 2068/2091 Office hours, DBH 2068/2091
6, May 6 Student progress presentations Student progress presentations
7, May 13 Office hours, DBH 2068/2091 Office hours, DBH 2068/2091
8, May 20 Student progress presentations Student progress presentations
9, May 27 Memorial Day, no class Guest speaker 2
10, June 3 Final student presentations I Final student presentations II
11, June 10 Final project reports due (11:45 PM) Final student presentations III (from 4:00-6:00pm)

Assignments, Projects, and Grading

Winter Grading Criteria (for 170A)

Homework: 40%
Project proposal: 50%
Class participation: 10%

Late homeworks will not be graded - please submit whatever you have completed by the homework deadline. Each homework's due date is listed in the table of homework assignments at the end of this page, and each homework will be due in EEE by 11:45 PM on the indicated date.

Spring Grading Criteria (for 170B)

Attendance: 10%
Weekly status updates: 18%
In-class presentations: 40%
Final written report: 32%

A single grade will be assigned at the end of Spring quarter for this class, with 50% weight on the Winter grade and 50% on the Spring grade.

Winter Homework and Class Participation

The first quarter will involve a mix of lectures and homework assignments intended to dust off, sharpen, or introduce the skills, tools, and techniques that you will need to successfully execute your course project. Since you are now seniors, and this is your Data Science grand finale, individual initiative and engagement will be expected of all students. The homework assignments may be "looser" than what you are used to -- you will have to seek out some of the information needed to complete the assignments and to make choices about how to attack some of the challenges -- i.e., spoon feeding will be kept to a minimum. The lectures will aim for interactivity, and class participation will be encouraged (and in fact expected).

Spring Project Tracking and Reporting

The second quarter will be a time to focus on your projects. In terms of grading, 10% will be given for attending the Invited DS Case Study talks every other week (5%) and making at least a brief appearance at office hours during the off weeks (5%). 18% of your grade will be based on weekly progress reporting via your "project diary" in Google docs; you will get 2 points/week (first 9 weeks) for your last week/this week progress bullets. 40% will be for the in-class presentations that you'll do (oral progress reports/demos plus a final oral presentation). Last but not least, of course, is your final written report, which will account for 32% of your Spring grade.

Academic Honesty Policy

Students will be expected to adhere to the UCI and ICS Academic Honesty policies (see http://www.editor.uci.edu/catalogue/appx/appx.2.htm#academic and http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details). Any student found to somehow be involved in cheating or aiding others in doing so will be academically prosecuted to the maximum extent possible: that means that you could fail this course in its entirety. (Ask around - it's happened.) Just say no to cheating!


Software Platform(s)

This course will make use of the Python ecosystem, including the Python language, various Python packages/tools for data analysis and machine learning, Jupyter notebooks, and open source databases (PostgreSQL). For convenience and package completeness, students are advised to download the most recent Anaconda distribution of Python and friends (https://www.anaconda.com/download/) and the most recent EDB distribution of PostgreSQL (https://www.enterprisedb.com/downloads/postgres-postgresql-downloads).


(Be sure to monitor this space carefully as the days and weeks go by!)

Assignment Due Date (and Time) Topic Details
HW 1 Wed, 1/16 (11:45 PM) Schemas and SQL Refresher HW1 (revised)
HW 2 Wed, 1/23 (11:45 PM) Data Wrangling Principles HW2 (revised)
HW 3 Wed, 1/30 (11:45 PM) Data Wrangling with Python and Pandas HW3
HW 4 Wed, 2/06 (11:45 PM) Capturing and Wrangling Twitter Data HW4
HW 5 Wed, 2/13 (11:45 PM) Exploratory Data Analysis and Visualization HW5
HW 6 Mon, 2/25 (11:45 PM) Clustering and PCA HW6
HW 7 Mon, 3/04 (11:45 PM) Prediction and Resampling HW7 (revised)
Proj Mon, 3/11 (12:00 PM) Written Project Proposal & Presentation Slides Project Proposal
Proj Mon, 3/11 (in class) Oral Project Proposal
Proj Wed, 3/13 (in class) Oral Project Proposal
Proj Fri, 3/22 (11:45 PM) Revised Written Proposal
Last modified 7 years ago Last modified on Jun 1, 2019, 9:07:12 AM

Attachments (41)

Note: See TracWiki for help on using the wiki.