STAT 170AB Winter/Spring 2019: Project in Data Science
Instructors
Mike Carey
E-Mail: mjcarey@ics.uci.edu
Office: DBH 2091
Vladimir Minin
E-Mail: vminin@uci.edu
Office: DBH 2068
We encourage you to use the Piazza site http://piazza.com/uci/winter2019/stats170a/home for any questions, comments, etc, that you have outside of class hours. We are more likely to quickly respond to questions via Piazza than via email.
Reader
Wail Alkowaileet
E-Mail: walkowai@ics.uci.edu
Office: DBH 2059
Office hours: Friday 12-2pm (Also available after each class)
Meeting Times & Places
Time: Mon/Wed 12:30-1:50 PM
Place: ICS 180
Course Overview
This two-course sequence is intended to be the "grand finale" for Data Science majors. Its goal is to tie together many of the topics that are independently covered in the first 3+ years of Data Science requirements and electives; it also aims to fill in in some of the potential gaps required to solve an end-to-end problem. In addition to a brief review of some of the required skills, the course will cover problem definition and analysis, data representation, algorithm selection, solution validation, and results presentation. Students will do projects, possibly in small teams, while the lecture periods will cover analysis alternatives, project planning, and data analysis issues. The first quarter will emphasize tools and techniques, approach selection, project planning, and experimental design. The second quarter will focus on project execution, data analysis, and presentation of results. Project planning and execution will include the setting of weekly or bi-weekly milestones as well as weekly progress tracking and reporting. The final course deliverables will include both a written project report and an oral presentation of the project results.
Prerequisites
Senior standing and completion of the following courses: STATS 68, STATS 111, IN4MATX 43, COMPSCI 122A, COMPSCI 161, and COMPSCI 178.
Textbooks
Data Wrangling with Python: Tips and Tools to Make Your Life Easier
By Jacqueline Kazil and Katharine Jarmul, O'Reilly Media, 2016.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd Edition)
By William McKinney, O'Reilly Media, 2017.
Principles of Data Wrangling: Practical Techniques for Data Preparation
By Joseph Hellerstein, Jeffrey Heer, Tye Rattenbury, Sean Kandel, and Connor Carreras, O'Reilly Media, 2017.
Mining the Social Web (2nd Edition, Chapters 1 and 9 in particular)
By Matthew Russell, O'Reilly Media, 2014.
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Chapters 1 through 4 in particular)
By Aurelien Geron, O'Reilly Media, 2017.
(Note: All of these titles are available for free online via the UCI Library's subscription to Safari Books Online (http://proquest.safaribooksonline.com/).
"Fundamentals of Data Visualization" by Claus O. Wilke, 2019. https://serialmentor.com/dataviz/
Project Resources
Predicting Depression from Social Media Updates -- A project sample from W18.
Project Ideas v2 -- A sample of project ideas.
https://www.data.gov/ -- Root page for a variety of US government data.
Final Presentation Guidelines -- Guidelines for your presentations
Final Report Template -- What to include in your final report
Other Resources
https://chrisalbon.com/#articles -- A great collection of relevant how-to and tutorial articles.
AsterixDB Overview -- An overview of Apache AsterixDB (a homegrown NoSQL DB system: http://asterixdb.apache.org/).
SQL++ for SQL Users: A Tutorial -- AsterixDB query language guide book by SQL inventor Don Chamberlin.
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets -- Twitter search API documentation.
Notes on machine learning from Prof Alex Ihler's CS 178 course: http://sli.ics.uci.edu/Classes/2016W-178.
PostgreSQL Installation Guide using Docker -- A guide to install containeraized PostgreSQL 11.1
Wisconsin Sailing Club loadable dataset for PostgreSQL -- PostgreSQL loadable dataset that appeared in Lecture 2
Anaconda3 Installation Guide using Docker -- A guide to install containeraized Anaconda3 + Jupyter
(And the Attachments section at the bottom of this page in general...)
Topic Coverage and Work Schedule
Winter Quarter Syllabus
| Date | Lecture Topic | Relevant Resources |
| Mon, 1/07 | Course overview and plan | This web page & Slides from Lecture 1 |
| Wed, 1/09 | Dusting off your databases | Your CS122a textbook & Slides from Lecture 2 |
| Mon, 1/14 | Data wrangling concepts and issues | Hellerstein et al book & Slides from Lecture 3 |
| Wed, 1/16 | Wrangling with Pandas and Dataframes I | McKinney book (Ch. 4, 5) & Notebook from Lecture 4 |
| Mon, 1/21 | (No Class) | (No Class) |
| Wed, 1/23 | Wrangling with Pandas and Dataframes II | McKinney book (Ch. 7, 8, 10 ) & Notebook from Lecture 5 |
| Mon, 1/28 | XML, JSON, Twitter, and Tweepy | McKinney book (Ch. 6) & Twitter Search API docs & PostgreSQL Notebook from Lecture 5 & Twitter Notebook from Lecture 6 |
| Wed, 1/30 | Semistructured data and SQL vs. NoSQL databases | AsterixDB overview paper & Chamberlin book & Slides from Lecture 7 |
| Mon, 2/04 | Exploratory data analysis and data visualization | Chapters 1-12 of Fundamentals of Data Visualization & Slides from Lecture 8 |
| Wed, 2/06 | Exploratory data analysis and data visualization | Chapters 13-26 of Fundamentals of Data Visualization & Slides from Lecture 9 |
| Mon, 2/11 | Clustering | ISLR_unsupervized_learning.pdf & clustering_demo.ipynb |
| Wed, 2/13 | Clustering and PCA | ISLR_unsupervized_learning.pdf & pca_demo.ipynb |
| Mon, 2/18 | (No Class) | (No Class) |
| Wed, 2/20 | Supervised learning and regression | ISLR_regression_classification.pdf |
| Mon, 2/25 | Resampling methods | ISLR_resampling.pdf |
| Wed, 2/27 | Project idea meetings | |
| Mon, 3/04 | Project planning meetings | |
| Wed, 3/06 | Project planning meetings (cont.) | |
| Mon, 3/11 | Oral project proposal meetings | |
| Wed, 3/13 | Oral project proposal presentations |
Spring Quarter Timetable
| Week | Monday | Wednesday |
| 1, April 1 | Short intro lecture | Joint office hours (review of project feedback), DBH 2091 |
| 2, April 8 | Project Kick-starter Day | Guest speaker 1 |
| 3, April 15 | Student progress presentations | Student progress presentations |
| 4, April 22 | Office hours, DBH 2068/2091 | Office hours, DBH 2068/2091 |
| 5, April 29 | Office hours, DBH 2068/2091 | Office hours, DBH 2068/2091 |
| 6, May 6 | Student progress presentations | Student progress presentations |
| 7, May 13 | Office hours, DBH 2068/2091 | Office hours, DBH 2068/2091 |
| 8, May 20 | Student progress presentations | Student progress presentations |
| 9, May 27 | Memorial Day, no class | Guest speaker 2 |
| 10, June 3 | Final student presentations I | Final student presentations II |
| 11, June 10 | Final project reports due (11:45 PM) | Final student presentations III (from 4:00-6:00pm) |
Assignments, Projects, and Grading
Winter Grading Criteria (for 170A)
Homework: 40%
Project proposal: 50%
Class participation: 10%
Late homeworks will not be graded - please submit whatever you have completed by the homework deadline. Each homework's due date is listed in the table of homework assignments at the end of this page, and each homework will be due in EEE by 11:45 PM on the indicated date.
Spring Grading Criteria (for 170B)
Attendance: 10%
Weekly status updates: 18%
In-class presentations: 40%
Final written report: 32%
A single grade will be assigned at the end of Spring quarter for this class, with 50% weight on the Winter grade and 50% on the Spring grade.
Winter Homework and Class Participation
The first quarter will involve a mix of lectures and homework assignments intended to dust off, sharpen, or introduce the skills, tools, and techniques that you will need to successfully execute your course project. Since you are now seniors, and this is your Data Science grand finale, individual initiative and engagement will be expected of all students. The homework assignments may be "looser" than what you are used to -- you will have to seek out some of the information needed to complete the assignments and to make choices about how to attack some of the challenges -- i.e., spoon feeding will be kept to a minimum. The lectures will aim for interactivity, and class participation will be encouraged (and in fact expected).
Spring Project Tracking and Reporting
The second quarter will be a time to focus on your projects. In terms of grading, 10% will be given for attending the Invited DS Case Study talks every other week (5%) and making at least a brief appearance at office hours during the off weeks (5%). 18% of your grade will be based on weekly progress reporting via your "project diary" in Google docs; you will get 2 points/week (first 9 weeks) for your last week/this week progress bullets. 40% will be for the in-class presentations that you'll do (oral progress reports/demos plus a final oral presentation). Last but not least, of course, is your final written report, which will account for 32% of your Spring grade.
Academic Honesty Policy
Students will be expected to adhere to the UCI and ICS Academic Honesty policies (see http://www.editor.uci.edu/catalogue/appx/appx.2.htm#academic and http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details). Any student found to somehow be involved in cheating or aiding others in doing so will be academically prosecuted to the maximum extent possible: that means that you could fail this course in its entirety. (Ask around - it's happened.) Just say no to cheating!
Software Platform(s)
This course will make use of the Python ecosystem, including the Python language, various Python packages/tools for data analysis and machine learning, Jupyter notebooks, and open source databases (PostgreSQL). For convenience and package completeness, students are advised to download the most recent Anaconda distribution of Python and friends (https://www.anaconda.com/download/) and the most recent EDB distribution of PostgreSQL (https://www.enterprisedb.com/downloads/postgres-postgresql-downloads).
Winter Assignment Due Dates & Links
(Be sure to monitor this space carefully as the days and weeks go by!)
| Assignment | Due Date (and Time) | Topic | Details |
| HW 1 | Wed, 1/16 (11:45 PM) | Schemas and SQL Refresher | HW1 (revised) |
| HW 2 | Wed, 1/23 (11:45 PM) | Data Wrangling Principles | HW2 (revised) |
| HW 3 | Wed, 1/30 (11:45 PM) | Data Wrangling with Python and Pandas | HW3 |
| HW 4 | Wed, 2/06 (11:45 PM) | Capturing and Wrangling Twitter Data | HW4 |
| HW 5 | Wed, 2/13 (11:45 PM) | Exploratory Data Analysis and Visualization | HW5 |
| HW 6 | Mon, 2/25 (11:45 PM) | Clustering and PCA | HW6 |
| HW 7 | Mon, 3/04 (11:45 PM) | Prediction and Resampling | HW7 (revised) |
| Proj | Mon, 3/11 (12:00 PM) | Written Project Proposal & Presentation Slides | Project Proposal |
| Proj | Mon, 3/11 (in class) | Oral Project Proposal | |
| Proj | Wed, 3/13 (in class) | Oral Project Proposal | |
| Proj | Fri, 3/22 (11:45 PM) | Revised Written Proposal |
Attachments (41)
- AsterixDBOverview.pdf (286.9 KB ) - added by 7 years ago.
-
lecture1_introduction.pdf
(3.3 MB
) - added by 7 years ago.
lecture 1
-
HW1.pdf
(113.3 KB
) - added by 7 years ago.
HW1
-
R_Installation_guide.pdf
(169.2 KB
) - added by 7 years ago.
PostgreSQL installation guide using Docker
-
CS122A-In-A-Nutshell-W19.pdf
(5.8 MB
) - added by 7 years ago.
CS122A in a Nutshell lecture (Lecture #2)
-
R_Hoofersdb.sql
(2.7 KB
) - added by 7 years ago.
Wisconsin Sailing Club Example
-
HW1_revised.pdf
(113.4 KB
) - added by 7 years ago.
Revised HW1 Description
-
STATS170-Data-Wrangling-Principles-W19.pdf
(5.8 MB
) - added by 7 years ago.
Data Wrangling Principles lecture (Lecture #3)
-
HW2.pdf
(129.3 KB
) - added by 7 years ago.
HW2: added what files to focus on
-
Lecture4-W19.ipynb
(110.5 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 4
-
Lecture5-W19.ipynb
(166.7 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 5
-
anaconda-starter-package.zip
(32.5 KB
) - added by 7 years ago.
Anaconda3 starter package
-
R_Installation_guide_anaconda3.pdf
(190.8 KB
) - added by 7 years ago.
Anaconda3 installation guide using Docker
- HW3.pdf (130.4 KB ) - added by 7 years ago.
-
Lecture6-PostgreSQL-W19.ipynb
(8.7 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 6 - PostgreSQL part
- Lecture6-Twitter-W19.ipynb (6.0 KB ) - added by 7 years ago.
- hoofersdb.sql (1.4 KB ) - added by 7 years ago.
-
HW4.pdf
(137.6 KB
) - added by 7 years ago.
HW4
-
STATS170-BigNoSQLData-W19.pdf
(4.0 MB
) - added by 7 years ago.
Big NoSQL Data Lecture
-
HW5.pdf
(48.6 KB
) - added by 7 years ago.
HW5
- lecture8_dataviz.pdf (5.4 MB ) - added by 7 years ago.
- lecture9_dataviz.pdf (5.2 MB ) - added by 7 years ago.
- ISLR_unsupervized_learning.pdf (906.1 KB ) - added by 7 years ago.
- clustering_demo.ipynb (914.0 KB ) - added by 7 years ago.
- pca_demo.ipynb (129.1 KB ) - added by 7 years ago.
- USArrests.csv (1.4 KB ) - added by 7 years ago.
- iris.csv (3.6 KB ) - added by 7 years ago.
- housing_data.csv (449.9 KB ) - added by 7 years ago.
- homework6.pdf (54.3 KB ) - added by 7 years ago.
-
Sharon_Babu_Honors_thesis_2018.pdf
(1.2 MB
) - added by 7 years ago.
Sharon Babu's Project (W2019)
- ISLR_regression_classification.pdf (6.7 MB ) - added by 7 years ago.
- ISLR_resampling.pdf (732.3 KB ) - added by 7 years ago.
- homework7.pdf (64.6 KB ) - added by 7 years ago.
- Default.xlsx (451.5 KB ) - added by 7 years ago.
-
Default.csv
(346.8 KB
) - added by 7 years ago.
Default Dataset
- project_proposal.docx (9.7 KB ) - added by 7 years ago.
-
project_ideas.pdf
(74.2 KB
) - added by 7 years ago.
Project Ideas
- project_ideas_v2.pdf (83.9 KB ) - added by 7 years ago.
- Project_Kick-starter_day.pdf (96.0 KB ) - added by 7 years ago.
- Project Final Presentation (STATS170AB - Spring 2019).pdf (72.5 KB ) - added by 7 years ago.
-
final_reprt_template.pdf
(217.9 KB
) - added by 7 years ago.
Final report template
