CS290 Spring 2016 - Texera/texera GitHub Wiki
CS290: Text Analytics in the Big Data Era
Spring 2016, Department of Computer Science, UC Irvine
- Instructor: Prof. Chen Li
- Lecture time: Mondays 4-5:30 pm, DBH 4011
- Office Hours: Mondays 3-4 pm, DBH 2092 (Email confirmation needed)
Goal:
- Gain hands-on experiences to build a system to manage large amounts of text information
- Study research challenges related to text and data management
- Form teams to do a group project; learn tools and skills to manage a software project.
Schedule
No. | Date | Topics | Todos |
---|---|---|---|
01 | 03/28/2016 | Introduction, SystemT Overview (by Instructor and Zuozhi) | Bid on tasks, form teams, github warmup |
02 | 04/04/2016 | Task assignments, [Lucene Overview] (https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit?usp=sharing) (by team 1) | Lucene sample program, design phase |
03 | 04/11/2016 | ScanOperator (team 1), Data Store (team 1), Development environment (team 2), progress report (all teams) | Design phase, operator interface, test cases |
04 | 04/18/2016 | Token-based fuzzy operator (Team 5), progress report (all teams) | Operator interface, test cases |
05 | 04/25/2016 | [Stanford NLP] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing) (Team 7), progress report (all teams) | Test cases, Implementation |
06 | 05/02/2016 | [Regex Matching] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit#slide=id.g12e478a39d_0_10) (Team 3), progress report (all teams) | Implementation |
07 | 05/09/2016 | [Fuzzy Tokenizer] (Foobar) (Team 2), progress report (all teams) | Implementation, Documentation |
08 | 05/16/2016 | Progress report (all teams) | Finishing Implementation, Starting Documentation |
Course schedule:
- Meet weekly with talks and project discussions;
- Form teams to work together;
- Evaluate existing software packages;
- Design and implement a text-centric data-management system.
Prerequisites:
- Hands-on system-building experiences;
- Familiar with Java and C/C++;
- Desire to learn, read existing software, and build systems;
- Eager to solve open problems;
- (Optional but a big plus) Have taken CS222 or CS221.
Commitment: 10 hours per week, 2 units
Software Tools:
- Java
- Maven
- Git
- Wiki
- Issue tracking
- Jenkins
Tasks (Welcome to propose your own):
- Support dictionary-based search on documents (using Lucene)
- Build gram-based inverted index (using Lucene)
- Support fuzzy search with gram index (using Lucene)
- Support regex search with gram index (using Lucene)
- Develop a query processor
- Write a parser and translator from a SystemT query to a Texera query
- (Optional) Design a declarative query language TextSQL and write a parser
- (Optional) Include an embedded DB (Derby) and store query results
Related Projects:
- Lucene on keyword search (Java)
- Flamingo (UCI) on fuzzy search (C++)
- RE2 on index-based regex (C++)
- SystemT (IBM) on information extraction (Java)
- Stanford NLP on natural language processing (Java)
Project Management:
- Form teams to do tasks. Each team has 1 or 2 members;
- Write test cases first;
- If possible, use a simplest solution (even if it's scan-based), then develop a more advanced solution;
- Be prepared to make adjustments during the course of the project.
Project Protocol:
- Do not add large files to git. Check github guidance for details.
- Write high-quality code.
- Do high-quality peer reviews.
- Write good documentations using github wiki. Each wiki page has authors and reviewers with email address.
- Drawing diagrams: Use Google Drawings. Add diagram source files to Google Drive and change the ownership to "texeraproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an example.
- Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/".
- Use Github Issues to manage tasks and bugs.
Project Lead:
Tasks:
Dictionary Matcher Operator
Sandeep Reddy Madugala | Rajesh Yarlagadda | Sudeep Meduri |
Query-Rewriter Operator
Kishore Narendran | Shiladitya Sen |
Regex Matcher Operator
Zuozhi Wang | Shuying |
Keyword Matcher Operator
Akshay Jain | Prakul Agarwal |
Token-based Fuzzy Matcher
Varun Bharill | Parag Sarogi |
System T comparison
Jinggang Diao | Flavio Bayer | Qing Tang |
Integrating Stanford NLP
Feng Hong | Yang Jiao |