CS221 Spring 2019: Information Retrieval
Time: Tu/Th, 2:00- 3:20pm (Check the UCI Calendar for holiday information)
Location: PCB 1200.
| Staff | Office Hours | Place | |
| Instructor: Chen Li | Tu, 12:50 - 1:50 | DBH 2092 | chenli AT ics.uci.edu |
| Zuozhi Wang | Wed 1-2 | ICS 464C | zuozhiw AT uci.edu |
Final
The final will be on Thursday, Jun 13, 1:30 - 3:30 p.m, in the same classroom. It will be closed books and closed notes. You can bring an A4-sized "cheat sheet." The exam will cover all materials taught in this quarter. Prof. Li will have an office hour on Monday 3 - 5 pm in his DBH 2092 office.
Syllabus and Lecture Notes
| No. | Date | Content | Notes | Student Contributors |
| 01 | 4/02/19, Tu | Course overview | Notes01 | |
| 02 | 4/04/19, Th | IR overview, Text processing, Tokenization, Word-Breaking Video | Notes02 | Rui Guo, Shreyas Badiger Mahadev, and Sadeem Saleh |
| 03 | 4/09/19, Tu | Text processing, Tokenization, Stop words, Normalization, Stemming | Notes03 | Fuyao Li and Zhuang Liu |
| 04 | 4/11/19, Th | Text Processing, Stemming, Phrases, Character Encoding | Notes04 | Xuan Liu and Ned Beigiparast |
| 05 | 4/16/19, Tu | Character Encoding, Inverted Indexing Overview | Notes05 | Zhetai Shao and Yuyang Chen |
| 06 | 4/18/19, Th | Inverted Indexing, LSM | Notes06 | Haixiang Yan and Hanlu Chen |
| 07 | 4/23/19, Tu | Project 2, LSM delete | Notes07 | Chenghao Zhang and Heming Sha |
| 08 | 4/25/19, Th | LSM Merge, Compression | Notes08 | Ziyuan Cui and Zhongwei Wang |
| 09 | 4/30/19, Tu | Dictionary and Posting List Compression | Notes09 | Yang Xing and Liang Tang |
| 10 | 5/02/19, Th | Skip List, Boolean Model | Notes10 | Kaifu Jiang and Ziyang Zhang |
| 11 | 5/07/19, Tu | Project 2 review, Project 3, Positional indexing | Notes11 | Kevin Omidvaran and Mikhail Lychagin |
| 12 | 5/09/19, Th | Distributed Indexing and Ranked Retrieval | Notes12 | Wei-Che Chen and Meng-Hsuang Chiang |
| 13 | 5/14/19, Tu | BiWords (Notes 11), Ranked Retrieval, Vector Space Model | Notes13 | Danni Xiong and Yidan Zhu |
| 14 | 5/16/19, Th | Vector Space Model, Efficient Scoring | Notes14 | Satish Kotti and Mengqi Wang |
| 15 | 5/21/19, Tu | IR Evaluation | Notes15 | Zixu Wang and Jiaren Cai |
| 16 | 5/23/19, Th | Project 4 (Task 1), Web Crawling 1 | Notes16 | Pritha Dawn and Nevedha Ravi |
| 17 | 5/28/19, Tu | PageRank, Project 4 (Task 2), Web Crawling 2 | Notes17 | Yiheng Xu and Tianyi Wei |
| 18 | 5/30/19, Th | MinHash | Notes18 | Yueh Wu and Chiyu Cheng |
| 19 | 6/04/19, Tu | LSH and Web Search Business | Notes19 | |
| 20 | 6/06/19, Th | Enterprise Search, The SRCH2 story, Wrap up | Notes20 |
Online Resources
- github repository
- Team signup: Google Spreadsheet (Use your UCI email to open the sheet)
- Discussion: Slack. Invitation to Slack will be sent to your UCI email address. Here's the protocol for us to use Slack.
- Gradescope: Sign up on GradeScope as a student using your UCI email ID. You will be added automatically.
Projects
We are going to develop a search engine called "Peterman", where "Peter" stands for Peter the Anteater. The engine consists of several modules, and the "Document store" module is already provided by us. Throughout the quarter you will develop the modules one by one. For each of them, we will define the API, and you are required to provide an implementation. Each student team will also submit your own test cases to be shared with the class.
The development and submission of each module consist of three phases:
- Phase 0: Project released. We provide the API.
- Phase 1 (5 days): Students will design, write and submit at least 2 test cases. Test cases should be submitted via Github Pull Request.
- Phase 2 (3 days): Students will provide peer reviews on the submitted test cases to make sure they are all correct and have good quality. Each team should review 2 other teams' test cases and left comments under the pull requests.
- Phase 3 (before the deadline): The goal is to pass all the test cases before submitting the project. Notice that the development of the module should start early in Phase 1 in order to have enough time finish the work before the deadline.
| Project | Topic | Days | Weight | Test Cases Due | Review Due | Final Due Date |
| 1 | TA: Text analyzers (stemming and tokenization) | 17 days | 19 out of 76 | Week 1 Sun. (Apr 7) | Week 2, Wed. (Apr 10) | Week 3, Wed. (Apr. 17) |
| 2 | II: Inverted index, boolean search | 17 days | 19 out of 76 | Week 4 Tue. (Apr 23) | Week 4 Fri. (Apr 26) | Week5, Sun. (May 5) |
| 3 | PI: Positional index, phrase search | 17 days | 19 out of 76 | Week 6 Fri. (May 10) | Week 7 Mon. (May 13) | Week 8, Wed. (May 22) |
| 4 | RA: Ranking | 17 days | 19 out of 76 | Week 9 Tue. (May 28) | Week 9 Fri. (May 31) | Week 10, Sun. (June 9) |
Course Information
Overview
This course exposes students to principal concepts related to information retrieval, including text analyzers (e.g., stemming and tokenization), text indexing, inverted index, search (boolean expressions, phrase search), and ranking. It will also cover topics of Web search and vertical search, with related techniques such as crawling and search engine optimization. A significant part of the course is to implement a search engine using Java.
Prerequisites
Java programming; data structures and algorithms.
Grading Breakdown
Projects: 76%
Final: 20% (No midterm)
Class participation: 3%
EEE Class Evaluation: 1%
For all the graded projects and exams, if you disagree with the grading, you can discuss with us within two weeks after they are returned. After that, all the grades will be finalized.
Class participation
Students need to actively participate in the lectures. Before each lecture, the instructor will come up with initial slides using Google slides. We will assign a group of students to each lecture (except the first one) who will be responsible to keep notes during the lecture. If needed, they can take pictures, but no photos including the lecturer please :-) Within two days after the lecture, the assigned students will modify the initial slides and add more. After that, the whole class, including the lecturer, will give improvement comments and suggestions within one week, which should be taken care of by the group. 3% of the overall grade of the assigned students will be based on the quality of the final slides.
Textbooks
Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008, Book Site, PDF.
Using Github
You are required to use Github to manage your source code. Here are the instructions.
Working in Teams
Working together on projects is strongly encouraged. You need to form teams of 2 (two) students and submit one project per team making sure that the names of all the team members appear on the first page. Work in teams will be graded on a per team basis. A team of a single member needs an approval from the instructor and TA.
Students may leave their existing team in the quarter. But they cannot join any new team after the "divorce." For each team splitting, the team members should tell the instructor at least two weeks before the corresponding project/homework deadline.
Project Late Policy
- The official due date for each assignment is listed here on the wiki, and it is expected that students will turn the work in on or before that deadline.
- We offer a 24-hour grace period for each assignment automatically, and therefore accept submissions turned in within 24 hours of the due date, with penalty of 10 percent of the total grade of the project. For example, if a project is worth 20 points, your late project got 15 points, then you real score will be 15 - 2 = 13 points.
- Late assignments will NOT be accepted beyond the grace period, so do always aim to be on time! Please don't even ask, as this is what the 1-day grace period is intended for.
Policy on Academic Honesty
All students will be expected to adhere to the UCI and ICS Academic Honesty policies (see http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details). Any student found to be involved in cheating or aiding others in doing so will be academically prosecuted to the maximum extent possible: that means you will fail this course. Just say no to cheating!
