| Version 24 (modified by , 7 years ago) ( diff ) |
|---|
STAT 170AB Winter/Spring 2019: Project in Data Science
Instructors
Mike Carey
E-Mail: mjcarey@ics.uci.edu
Office: DBH 2091
Vladimir Minin
E-Mail: vminin@uci.edu
Office: DBH 2068
We encourage you to use the Piazza site http://piazza.com/uci/winter2019/stats170a/home for any questions, comments, etc, that you have outside of class hours. We are more likely to quickly respond to questions via Piazza than via email.
Reader
Wail Alkowaileet
E-Mail: walkowai@ics.uci.edu
Office: DBH 2059
Meeting Times & Places
Time: Mon/Wed 12:30-1:50 PM
Place: ICS 180
Course Overview
This two-course sequence is intended to be the "grand finale" for Data Science majors. Its goal is to tie together many of the topics that are independently covered in the first 3+ years of Data Science requirements and electives; it also aims to fill in in some of the potential gaps required to solve an end-to-end problem. In addition to a brief review of some of the required skills, the course will cover problem definition and analysis, data representation, algorithm selection, solution validation, and results presentation. Students will do projects, possibly in small teams, while the lecture periods will cover analysis alternatives, project planning, and data analysis issues. The first quarter will emphasize tools and techniques, approach selection, project planning, and experimental design. The second quarter will focus on project execution, data analysis, and presentation of results. Project planning and execution will include the setting of weekly or bi-weekly milestones as well as weekly progress tracking and reporting. The final course deliverables will include both a written project report and an oral presentation of the project results.
Prerequisites
Senior standing and completion of the following courses: STATS 68, STATS 111, IN4MATX 43, COMPSCI 122A, COMPSCI 161, and COMPSCI 178.
Textbooks
Data Wrangling with Python: Tips and Tools to Make Your Life Easier
By Jacqueline Kazil and Katharine Jarmul, O'Reilly Media, 2016.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd Edition)
By William McKinney, O'Reilly Media, 2017.
Principles of Data Wrangling: Practical Techniques for Data Preparation
By Joseph Hellerstein, Jeffrey Heer, Tye Rattenbury, Sean Kandel, and Connor Carreras, O'Reilly Media, 2017.
Mining the Social Web (2nd Edition, Chapters 1 and 9 in particular)
By Matthew Russell, O'Reilly Media, 2014.
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Chapters 1 through 4 in particular)
By Aurelien Geron, O'Reilly Media, 2017.
(Note: All of these titles are available for free online via the UCI Library's subscription to Safari Books Online (http://proquest.safaribooksonline.com/).
Project Resources
Links to potential data sets:
Links to reference articles for projects:
Links to reference texts for projects:
Links to examples of software and demos:
Other Resources
https://chrisalbon.com/#articles -- A great collection of relevant how-to and tutorial articles.
AsterixDB Overview -- An overview of Apache AsterixDB (a homegrown NoSQL DB system: http://asterixdb.apache.org/).
SQL++ for SQL Users: A Tutorial -- AsterixDB query language guide book by SQL inventor Don Chamberlin.
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets -- Twitter search API documentation.
Notes on machine learning from Prof Alex Ihler's CS 178 course: http://sli.ics.uci.edu/Classes/2016W-178.
Topic Coverage and Work Schedule
Winter Quarter Syllabus
| Date | Lecture Topic | Relevant Resources |
| Mon, 1/07 | Course overview and plan | Slides from Lecture 1 |
| Wed, 1/09 | Dusting off your databases | Your CS122a textbook |
| Mon, 1/14 | Data wrangling concepts and issues | Hellerstein et al book |
| Wed, 1/16 | Wrangling with Pandas and Dataframes I | McKinney book (Ch. 4, 5) |
| Mon, 1/21 | (No Class) | (No Class) |
| Wed, 1/23 | Wrangling with Pandas and Dataframes II | McKinney book (Ch. 7, 8, 10 ) |
| Mon, 1/28 | XML, JSON, Twitter, and Tweepy | McKinney book (Ch. 6) & Twitter Search API docs |
| Wed, 1/30 | Semistructured data and SQL vs. NoSQL databases | AsterixDB overview paper & Chamberlin book |
| Mon, 2/04 | Exploratory data analysis and data visualization | Class lecture slides |
| Wed, 2/06 | Cluster analysis algorithms | Unsupervised learning notes from 178 above |
| Mon, 2/11 | Predictive modeling: regression | Chapters 2 and 4 in Geron text |
| Wed, 2/13 | Predictive modeling: classification | Chapter 3 in Geron text |
| Mon, 2/18 | (No Class) | (No Class) |
| Wed, 2/20 | Text analysis methods | |
| Mon, 2/25 | Project planning, proposals and guidelines | |
| Wed, 2/27 | Project idea meetings | |
| Mon, 3/04 | Project planning meetings | |
| Wed, 3/06 | Project planning meetings (cont.) | |
| Mon, 3/11 | Oral project proposal meetings | |
| Wed, 3/13 | Oral project proposal presentations |
Assignments, Projects, and Grading
Winter Grading Criteria (for 170A)
Homework: 40%
Project proposal: 50%
Class participation: 10%
Late homeworks will not be graded - please submit whatever you have completed by the homework deadline. Each homework's due date is listed in the table of homework assignments at the end of this page, and each homework will be due in EEE by 11:45 PM on the indicated date.
Spring Grading Criteria (for 170B)
Attendance: 10%
Weekly status updates: 18%
In-class presentations: 40%
Final written report: 32%
A single grade will be assigned at the end of Spring quarter for this class, with 50% weight on the Winter grade and 50% on the Spring grade.
Winter Homework and Class Participation
The first quarter will involve a mix of lectures and homework assignments intended to dust off, sharpen, or introduce the skills, tools, and techniques that you will need to successfully execute your course project. Since you are now seniors, and this is your Data Science grand finale, individual initiative and engagement will be expected of all students. The homework assignments may be "looser" than what you are used to -- you will have to seek out some of the information needed to complete the assignments and to make choices about how to attack some of the challenges -- i.e., spoon feeding will be kept to a minimum. The lectures will aim for interactivity, and class participation will be encouraged (and in fact expected).
Spring Project Tracking and Reporting
The second quarter will be a time to focus on your projects. In terms of grading, 10% will be given for attending the Invited DS Case Study talks every other week (5%) and making at least a brief appearance at office hours during the off weeks (5%). 18% of your grade will be based on weekly progress reporting via your "project diary" in Google docs; you will get 2 points/week (first 9 weeks) for your last week/this week progress bullets. 40% will be for the in-class presentations that you'll do (oral progress reports/demos plus a final oral presentation). Last but not least, of course, is your final written report, which will account for 32% of your Spring grade.
Academic Honesty Policy
Students will be expected to adhere to the UCI and ICS Academic Honesty policies (see http://www.editor.uci.edu/catalogue/appx/appx.2.htm#academic and http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details). Any student found to somehow be involved in cheating or aiding others in doing so will be academically prosecuted to the maximum extent possible: that means that you could fail this course in its entirety. (Ask around - it's happened.) Just say no to cheating!
Software Platform(s)
This course will make use of the Python ecosystem, including the Python language, various Python packages/tools for data analysis and machine learning, Jupyter notebooks, and open source databases (PostgreSQL). For convenience and package completeness, students are advised to download the most recent Anaconda distribution of Python and friends (https://www.anaconda.com/download/) and the most recent EDB distribution of PostgreSQL (https://www.enterprisedb.com/downloads/postgres-postgresql-downloads).
Winter Assignment Due Dates & Links
(Be sure to monitor this space carefully as the days and weeks go by!)
| Assignment | Due Date (and Time) | Topic | Details |
| HW 1 | Wed, 1/16 (11:45 PM) | Schemas and SQL Refresher | HW1 |
| HW 2 | Wed, 1/23 (11:45 PM) | Data Wrangling Principles | HW2 |
| HW 3 | Wed, 1/30 (11:45 PM) | Data Wrangling with Python and Pandas | HW3 |
| HW 4 | Wed, 2/06 (11:45 PM) | Capturing and Wrangling Twitter Data | HW4 |
| HW 5 | Wed, 2/13 (11:45 PM) | Exploratory Data Analysis and Visualization | HW5 |
| HW 6 | Wed, 2/20 (11:45 PM) | Predictive Regression Models | HW6 |
| HW 7 | Wed, 2/27 (11:45 PM) | Classification with Text Data | HW7 |
| Proj 1 | Wed, 3/06 (11:45 PM) | Written Project Proposal, Take 1 | |
| Proj 2 | Mon, 3/11 (11:45 PM) | Written Project Proposal, Take 2 | |
| Proj 3 | Wed, 3/13 (in class) | Oral Project Proposal |
Attachments (41)
- AsterixDBOverview.pdf (286.9 KB ) - added by 7 years ago.
-
lecture1_introduction.pdf
(3.3 MB
) - added by 7 years ago.
lecture 1
-
HW1.pdf
(113.3 KB
) - added by 7 years ago.
HW1
-
R_Installation_guide.pdf
(169.2 KB
) - added by 7 years ago.
PostgreSQL installation guide using Docker
-
CS122A-In-A-Nutshell-W19.pdf
(5.8 MB
) - added by 7 years ago.
CS122A in a Nutshell lecture (Lecture #2)
-
R_Hoofersdb.sql
(2.7 KB
) - added by 7 years ago.
Wisconsin Sailing Club Example
-
HW1_revised.pdf
(113.4 KB
) - added by 7 years ago.
Revised HW1 Description
-
STATS170-Data-Wrangling-Principles-W19.pdf
(5.8 MB
) - added by 7 years ago.
Data Wrangling Principles lecture (Lecture #3)
-
HW2.pdf
(129.3 KB
) - added by 7 years ago.
HW2: added what files to focus on
-
Lecture4-W19.ipynb
(110.5 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 4
-
Lecture5-W19.ipynb
(166.7 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 5
-
anaconda-starter-package.zip
(32.5 KB
) - added by 7 years ago.
Anaconda3 starter package
-
R_Installation_guide_anaconda3.pdf
(190.8 KB
) - added by 7 years ago.
Anaconda3 installation guide using Docker
- HW3.pdf (130.4 KB ) - added by 7 years ago.
-
Lecture6-PostgreSQL-W19.ipynb
(8.7 KB
) - added by 7 years ago.
Jupyter notebook for Lecture 6 - PostgreSQL part
- Lecture6-Twitter-W19.ipynb (6.0 KB ) - added by 7 years ago.
- hoofersdb.sql (1.4 KB ) - added by 7 years ago.
-
HW4.pdf
(137.6 KB
) - added by 7 years ago.
HW4
-
STATS170-BigNoSQLData-W19.pdf
(4.0 MB
) - added by 7 years ago.
Big NoSQL Data Lecture
-
HW5.pdf
(48.6 KB
) - added by 7 years ago.
HW5
- lecture8_dataviz.pdf (5.4 MB ) - added by 7 years ago.
- lecture9_dataviz.pdf (5.2 MB ) - added by 7 years ago.
- ISLR_unsupervized_learning.pdf (906.1 KB ) - added by 7 years ago.
- clustering_demo.ipynb (914.0 KB ) - added by 7 years ago.
- pca_demo.ipynb (129.1 KB ) - added by 7 years ago.
- USArrests.csv (1.4 KB ) - added by 7 years ago.
- iris.csv (3.6 KB ) - added by 7 years ago.
- housing_data.csv (449.9 KB ) - added by 7 years ago.
- homework6.pdf (54.3 KB ) - added by 7 years ago.
-
Sharon_Babu_Honors_thesis_2018.pdf
(1.2 MB
) - added by 7 years ago.
Sharon Babu's Project (W2019)
- ISLR_regression_classification.pdf (6.7 MB ) - added by 7 years ago.
- ISLR_resampling.pdf (732.3 KB ) - added by 7 years ago.
- homework7.pdf (64.6 KB ) - added by 7 years ago.
- Default.xlsx (451.5 KB ) - added by 7 years ago.
-
Default.csv
(346.8 KB
) - added by 7 years ago.
Default Dataset
- project_proposal.docx (9.7 KB ) - added by 7 years ago.
-
project_ideas.pdf
(74.2 KB
) - added by 7 years ago.
Project Ideas
- project_ideas_v2.pdf (83.9 KB ) - added by 7 years ago.
- Project_Kick-starter_day.pdf (96.0 KB ) - added by 7 years ago.
- Project Final Presentation (STATS170AB - Spring 2019).pdf (72.5 KB ) - added by 7 years ago.
-
final_reprt_template.pdf
(217.9 KB
) - added by 7 years ago.
Final report template
