wiki:cs221-2019-spring

CS221 Spring 2019: Information Retrieval

Time: Tu/Th, 2:00- 3:20pm (Check the UCI Calendar for holiday information)
Location: PCB 1200.

Staff Office Hours Place Email
Instructor: Chen Li Tu, 12:50 - 1:50 DBH 2092 chenli AT ics.uci.edu
Zuozhi Wang Wed 1-2 ICS 464C zuozhiw AT uci.edu

Final

The final will be on Thursday, Jun 13, 1:30 - 3:30 p.m, in the same classroom. It will be closed books and closed notes. You can bring an A4-sized "cheat sheet." The exam will cover all materials taught in this quarter. Prof. Li will have an office hour on Monday 3 - 5 pm in his DBH 2092 office.

Syllabus and Lecture Notes

No. Date Content Notes Student Contributors
01 4/02/19, Tu Course overview Notes01
02 4/04/19, Th IR overview, Text processing, Tokenization, Word-Breaking Video Notes02 Rui Guo, Shreyas Badiger Mahadev, and Sadeem Saleh
03 4/09/19, Tu Text processing, Tokenization, Stop words, Normalization, Stemming Notes03 Fuyao Li and Zhuang Liu
04 4/11/19, Th Text Processing, Stemming, Phrases, Character Encoding Notes04 Xuan Liu and Ned Beigiparast
05 4/16/19, Tu Character Encoding, Inverted Indexing Overview Notes05 Zhetai Shao and Yuyang Chen
06 4/18/19, Th Inverted Indexing, LSM Notes06 Haixiang Yan and Hanlu Chen
07 4/23/19, Tu Project 2, LSM delete Notes07 Chenghao Zhang and Heming Sha
08 4/25/19, Th LSM Merge, Compression Notes08 Ziyuan Cui and Zhongwei Wang
09 4/30/19, Tu Dictionary and Posting List Compression Notes09 Yang Xing and Liang Tang
10 5/02/19, Th Skip List, Boolean Model Notes10 Kaifu Jiang and Ziyang Zhang
11 5/07/19, Tu Project 2 review, Project 3, Positional indexing Notes11 Kevin Omidvaran and Mikhail Lychagin
12 5/09/19, Th Distributed Indexing and Ranked Retrieval Notes12 Wei-Che Chen and Meng-Hsuang Chiang
13 5/14/19, Tu BiWords (Notes 11), Ranked Retrieval, Vector Space Model Notes13 Danni Xiong and Yidan Zhu
14 5/16/19, Th Vector Space Model, Efficient Scoring Notes14 Satish Kotti and Mengqi Wang
15 5/21/19, Tu IR Evaluation Notes15 Zixu Wang and Jiaren Cai
16 5/23/19, Th Project 4 (Task 1), Web Crawling 1 Notes16 Pritha Dawn and Nevedha Ravi
17 5/28/19, Tu PageRank, Project 4 (Task 2), Web Crawling 2 Notes17 Yiheng Xu and Tianyi Wei
18 5/30/19, Th MinHash Notes18 Yueh Wu and Chiyu Cheng
19 6/04/19, Tu LSH and Web Search Business Notes19
20 6/06/19, Th Enterprise Search, The SRCH2 story, Wrap up Notes20

Online Resources

Projects

We are going to develop a search engine called "Peterman", where "Peter" stands for Peter the Anteater. The engine consists of several modules, and the "Document store" module is already provided by us. Throughout the quarter you will develop the modules one by one. For each of them, we will define the API, and you are required to provide an implementation. Each student team will also submit your own test cases to be shared with the class.

The development and submission of each module consist of three phases:

  • Phase 0: Project released. We provide the API.
  • Phase 1 (5 days): Students will design, write and submit at least 2 test cases. Test cases should be submitted via Github Pull Request.
  • Phase 2 (3 days): Students will provide peer reviews on the submitted test cases to make sure they are all correct and have good quality. Each team should review 2 other teams' test cases and left comments under the pull requests.
  • Phase 3 (before the deadline): The goal is to pass all the test cases before submitting the project. Notice that the development of the module should start early in Phase 1 in order to have enough time finish the work before the deadline.
Project Topic Days Weight Test Cases Due Review Due Final Due Date
1 TA: Text analyzers (stemming and tokenization) 17 days 19 out of 76 Week 1 Sun. (Apr 7) Week 2, Wed. (Apr 10) Week 3, Wed. (Apr. 17)
2 II: Inverted index, boolean search 17 days 19 out of 76 Week 4 Tue. (Apr 23) Week 4 Fri. (Apr 26) Week5, Sun. (May 5)
3 PI: Positional index, phrase search 17 days 19 out of 76 Week 6 Fri. (May 10) Week 7 Mon. (May 13) Week 8, Wed. (May 22)
4 RA: Ranking 17 days 19 out of 76 Week 9 Tue. (May 28) Week 9 Fri. (May 31) Week 10, Sun. (June 9)

Course Information

Overview

This course exposes students to principal concepts related to information retrieval, including text analyzers (e.g., stemming and tokenization), text indexing, inverted index, search (boolean expressions, phrase search), and ranking. It will also cover topics of Web search and vertical search, with related techniques such as crawling and search engine optimization. A significant part of the course is to implement a search engine using Java.

Prerequisites

Java programming; data structures and algorithms.

Grading Breakdown

Projects: 76%
Final: 20% (No midterm)
Class participation: 3%
EEE Class Evaluation: 1%

For all the graded projects and exams, if you disagree with the grading, you can discuss with us within two weeks after they are returned. After that, all the grades will be finalized.

Class participation

Students need to actively participate in the lectures. Before each lecture, the instructor will come up with initial slides using Google slides. We will assign a group of students to each lecture (except the first one) who will be responsible to keep notes during the lecture. If needed, they can take pictures, but no photos including the lecturer please :-) Within two days after the lecture, the assigned students will modify the initial slides and add more. After that, the whole class, including the lecturer, will give improvement comments and suggestions within one week, which should be taken care of by the group. 3% of the overall grade of the assigned students will be based on the quality of the final slides.

Textbooks

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008, Book Site, PDF.

Using Github

You are required to use Github to manage your source code. Here are the instructions.

Working in Teams

Working together on projects is strongly encouraged. You need to form teams of 2 (two) students and submit one project per team making sure that the names of all the team members appear on the first page. Work in teams will be graded on a per team basis. A team of a single member needs an approval from the instructor and TA.

Students may leave their existing team in the quarter. But they cannot join any new team after the "divorce." For each team splitting, the team members should tell the instructor at least two weeks before the corresponding project/homework deadline.

Project Late Policy

  • The official due date for each assignment is listed here on the wiki, and it is expected that students will turn the work in on or before that deadline.
  • We offer a 24-hour grace period for each assignment automatically, and therefore accept submissions turned in within 24 hours of the due date, with penalty of 10 percent of the total grade of the project. For example, if a project is worth 20 points, your late project got 15 points, then you real score will be 15 - 2 = 13 points.
  • Late assignments will NOT be accepted beyond the grace period, so do always aim to be on time! Please don't even ask, as this is what the 1-day grace period is intended for.

Policy on Academic Honesty

All students will be expected to adhere to the UCI and ICS Academic Honesty policies (see http://www.ics.uci.edu/ugrad/policies/index.php#academic_honesty to read their details). Any student found to be involved in cheating or aiding others in doing so will be academically prosecuted to the maximum extent possible: that means you will fail this course. Just say no to cheating!

Last modified 2 months ago Last modified on Jun 5, 2019 8:23:13 AM