CS290 Spring 2016 - Texera/texera GitHub Wiki

CS290: Text Analytics in the Big Data Era

Spring 2016, Department of Computer Science, UC Irvine

  • Instructor: Prof. Chen Li
  • Lecture time: Mondays 4-5:30 pm, DBH 4011
  • Office Hours: Mondays 3-4 pm, DBH 2092 (Email confirmation needed)

Goal:

  • Gain hands-on experiences to build a system to manage large amounts of text information
  • Study research challenges related to text and data management
  • Form teams to do a group project; learn tools and skills to manage a software project.

Poster Presentation

System Overview

Email list (Google Groups)

Management Sheet

Google Drive

Schedule

No. Date Topics Todos
01 03/28/2016 Introduction, SystemT Overview (by Instructor and Zuozhi) Bid on tasks, form teams, github warmup
02 04/04/2016 Task assignments, [Lucene Overview] (https://docs.google.com/presentation/d/1P9HUFFW72ogqdEZf07r5Y7_gM9JK6Wu8UVgH0bGNkF0/edit?usp=sharing) (by team 1) Lucene sample program, design phase
03 04/11/2016 ScanOperator (team 1), Data Store (team 1), Development environment (team 2), progress report (all teams) Design phase, operator interface, test cases
04 04/18/2016 Token-based fuzzy operator (Team 5), progress report (all teams) Operator interface, test cases
05 04/25/2016 [Stanford NLP] (https://docs.google.com/presentation/d/1ek18Zr0OqQ0RONj8D7W2aSGs9sz1etnf9bEnWTEA2ag/edit?usp=sharing) (Team 7), progress report (all teams) Test cases, Implementation
06 05/02/2016 [Regex Matching] (https://docs.google.com/presentation/d/1F3Xboeb_azHSjWbJ2Cl36kGHpIeo_6-lI24XwXjq_hA/edit#slide=id.g12e478a39d_0_10) (Team 3), progress report (all teams) Implementation
07 05/09/2016 [Fuzzy Tokenizer] (Foobar) (Team 2), progress report (all teams) Implementation, Documentation
08 05/16/2016 Progress report (all teams) Finishing Implementation, Starting Documentation

Course schedule:

  • Meet weekly with talks and project discussions;
  • Form teams to work together;
  • Evaluate existing software packages;
  • Design and implement a text-centric data-management system.

Prerequisites:

  • Hands-on system-building experiences;
  • Familiar with Java and C/C++;
  • Desire to learn, read existing software, and build systems;
  • Eager to solve open problems;
  • (Optional but a big plus) Have taken CS222 or CS221.

Commitment: 10 hours per week, 2 units

Software Tools:

  • Java
  • Maven
  • Git
  • Wiki
  • Issue tracking
  • Jenkins

Tasks (Welcome to propose your own):

  • Support dictionary-based search on documents (using Lucene)
  • Build gram-based inverted index (using Lucene)
  • Support fuzzy search with gram index (using Lucene)
  • Support regex search with gram index (using Lucene)
  • Develop a query processor
  • Write a parser and translator from a SystemT query to a Texera query
  • (Optional) Design a declarative query language TextSQL and write a parser
  • (Optional) Include an embedded DB (Derby) and store query results

Related Projects:

Project Management:

  • Form teams to do tasks. Each team has 1 or 2 members;
  • Write test cases first;
  • If possible, use a simplest solution (even if it's scan-based), then develop a more advanced solution;
  • Be prepared to make adjustments during the course of the project.

Project Protocol:

  • Do not add large files to git. Check github guidance for details.
  • Write high-quality code.
  • Do high-quality peer reviews.
  • Write good documentations using github wiki. Each wiki page has authors and reviewers with email address.
  • Drawing diagrams: Use Google Drawings. Add diagram source files to Google Drive and change the ownership to "texeraproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an example.
  • Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/".
  • Use Github Issues to manage tasks and bugs.

Project Lead:

Chen Li
Chen Li

Tasks:

Dictionary Matcher Operator

Sandeep Reddy Madugula Rajesh Yarlagadda Sudeep Meduri
Sandeep Reddy Madugala Rajesh Yarlagadda Sudeep Meduri

Query-Rewriter Operator

Kishore Narendran Shiladitya Sen
Kishore Narendran Shiladitya Sen

Regex Matcher Operator

Zuozhi Wang Shuying
Zuozhi Wang Shuying

Keyword Matcher Operator

Akshay Jain Prakul Agarwal
Akshay Jain Prakul Agarwal

Token-based Fuzzy Matcher

Varun Bharill Parag Sarogi
Varun Bharill Parag Sarogi

System T comparison

Jinggang Diao Flavio Bayer Qing Tang
Jinggang Diao Flavio Bayer Qing Tang

Integrating Stanford NLP

Feng Hong Yang Jiao
Feng Hong Yang Jiao