Enron Email Database UseCase Doc - lambda-land/VDBMS GitHub Wiki
Enron Email Database UseCase Documents
Use case queries and schemas
- Queries and Schemas in Haskell is located in: Link: https://github.com/lambda-land/VDBMS/tree/VDBMSCaseStudy/src/VDBMS/UseCases/EnronUseCase
Steps take to build VDB for Enron Case Study
- Import the enron data into a new database. eg.
mysql -u #username# -p #database# < #dump_file#
- Run Query
VDBMS/databases/enronEmailDB/step1_build_enron_variants.sql
(to create view for db variants of different product) - Run Query
VDBMS/databases/enronEmailDB/step2_build_enron_vdb.sql
(to create v-db) - Get a mysql dump by
mysqldump --user=root --no-create-db --no-create-info --password --skip-set-charset --skip-add-locks --skip-disable-keys enron v_alias v_auto_msg v_employee v_filter_msg v_forward_msg v_mailhost v_message v_recipientinfo v_remail_msg > enron_mysql.sql
- Adjust the dump for postgres by:
- removing the set argument commands at the beginning and end of the file by:
- printing command files to make sure you have the right lines:
sed -n '6,11p' enron_mysql.sql
- then deleting them:
sed -n '6,11d' enron_mysql.sql
- getting the number of lines:
wc -l enron_mysql.sql
- finding out where the commands start:
tail -n 8 enron_mysql.sql
- make sure you have the range right:
sed -n '213,220p' enron_mysql.sql
- and finally deleting them:
sed -n '213,220d' enron_mysql.sql
- printing command files to make sure you have the right lines:
- adjust the name of the table to postgres by running the following for all tables:
find . -name enron_mysql.sql -exec sed -i '' -e 's/INSERT INTO 'v_alias' /INSERT INTO public.v_alias /g' {} +
- note that in the above command 'v_alias' is actually wrapped in two of `
- removing the set argument commands at the beginning and end of the file by:
- Create an enron database in postgres and use create queries to create tables, here
- Import the postgres dump to postgres by:
psql -U postgres enron < enron_mysql.sql
- After importing the database connect to postgres and fix the type of is_something attributes in table v_message by:
ALTER TABLE v_message ALTER is_forward_msg TYPE boolean USING CASE WHEN is_forward_msg=0 THEN FALSE ELSE TRUE END;
- If you like to have a dump of the postgres database do:
pg_dump -U postgres enron > enron_postgres.sql
Original Enron Email Database
Data Set
Enron Data is From here
Schema
-
employee(eid, firstname, lastname, email_id, email2, email3, email4, folder, status)
- size: 149 tuples
-
message(mid, sender, date, message_id, subject, body, folder)
- size: 252759 tuples
-
recipientinfo(rid, mid, rtype, rvalue, dater)
- size: 2064442 tuples
-
referenceinfo(rfid, mid, reference)
- size: 54778
Enron Email Variational Database
1. Schema
-
v_employee(eid, firstname, lastname, email_id, folder, status, verification_key, public_key, presCond)
- verification_key is responded to signature
- public_key is responded to encrypt
-
v_message(mid, sender, date, message_id, subject, body, folder, is_system_notification, is_signed, is_encrypted, is_autoresponse, is_forward_msg, presCond)
- is_signed is responded to signature
- is_encrypted is responded to encrypt
-
v_recipientinfo(rid, mid, rtype, rvalue, presCond)
-
v_referenceinfo(rfid, mid, reference,presCond)
-
v_auto_msg(eid, subject, body, presCond)
- whole table will depend on autoresponder
-
v_forward_msg(eid, forwardaddr, presCond)
- whole table is responded to forwardmessage
-
v_remail_msg(eid, pseudonym, presCond)
- whole table is responded to remailmessage
-
v_filter_msg(eid, suffix, presCond)
- whole table is responded to filtermessages
-
v_mailhost(eid, username, mailhost, presCond)
- whole table is responded to mailhost
-
v_alias(eid, email, nickname, presCond)
- whole table is responded to addressbook
2. Features in Enron Email V-DB
- signature
- addressbook
- filtermsg
- autoresponder
- forwardmsg
- mailhost
- encrypt
- remailmsg
3. Variants Designed For Enron case study
-
- Products for daily use include features:
- Features:
- forwardmsg
- filtermsg
- Presence Condition(p1):
- forwardmsg AND filtermsg AND NOT (addressbook OR encrypt OR remailmsg OR autoresponder OR signature OR mailhost)
- Features:
- Products for daily use include features:
-
- Products that focus on Privacy
- Features:
- signature
- encrypt
- remailmsg
- Presence Condition(p2):
- signature AND encrypt AND remailmsg AND NOT (addressbook OR filtermsg OR autoresponder OR forwardmsg OR mailhost)
- Features:
- Products that focus on Privacy
-
- Products that focus on Group Usage
- features:
- addressbook
- autoresponder
- mailhost
- signature
- encrpt
- Presence Condition(p3):
- addressbook AND autoresponder AND mailhost AND encrypt AND signature AND (NOT (forwardmsg OR remailmsg OR filtermsg))
- features:
- Products that focus on Group Usage
-
- Products that enables all features
- Presence Condition(p4):
- signature AND addressbook AND filtermsg AND autoresponder AND forwardmsg AND mailhost AND encrypt AND remailmsg
- Presence Condition(p4):
- Products that enables all features
-
- Products that disables all features
- Presence Condition(p5):
- (NOT signature) AND (NOT addressbook) AND (NOT filtermsg) AND (NOT autoresponder) AND (NOT forwardmsg) AND (NOT mailhost) AND (NOT encrypt) AND (NOT remailmsg)
- Presence Condition(p5):
- Products that disables all features
Populate the Database
1. Break employee into 5 groups
We have 150 employees in Enron database and 5 product variants for email system, we therefore decide to break 150 employees into 5 groups (A,B,C,D,E), and let those 5 groups people be user of the corresponding 5 different product variants. Assign employee to different group based on employee number (eid) as follows:
- Product for daily usage:
0 < eid <= 30
- Product for privacy:
30 < eid <= 60
- Product for group usage:
60 < eid <= 90
- Product for all enabled features:
90 < eid <= 120
- Product for all disabled features:
120 < eid <= 150
v_employee
:
2. For v_employee(eid, firstname, lastname, email_id, folder, status, sign, public_key, presCond)
- ---- | eid | firstname | lastname | email_id | folder | status | verification_key | public_key | presCond |
---|---|---|---|---|---|---|---|---|---|
Daily | xx | xx | xx | xx | xx | xx | NULL | NULL | p1 |
Privacy | xx | xx | xx | xx | xx | xx | xx | xx | p2 |
Group | xx | xx | xx | xx | xx | xx | NULL | NULL | p3 |
Enable | xx | xx | xx | xx | xx | xx | xx | xx | p4 |
Disable | xx | xx | xx | xx | xx | xx | NULL | NULL | p5 |
v_message
3. For v_message(mid, sender, date, message_id, subject, body, folder, is_system_notification, is_signed, is_encrypted, is_from_remailer, is_autoresponse, is_forward_msg, presCond)
mid | sender | date | message_id | subject | body | folde | is_system_notification | is_signed | is_encrypted | is_autoresponse | is_forward_msg | presCond | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xx | xx | xx | xx | xx | xx | xx | xx | NULL | NULL | NULL | xx | p1 | |
xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | NULL | NULL | p2 | |
xx | xx | xx | xx | xx | xx | xx | xx | NULL | NULL | xx | NULL | p3 | |
xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | p4 | |
xx | xx | xx | xx | xx | xx | xx | xx | NULL | NULL | NULL | NULL | p5 |
v_recipientinfo
:
4. For v_recipientinfo(rid, mid, rtype, rvalue, presCond)
rid | mid | rtype | rtype | rvalue | presCond | |
---|---|---|---|---|---|---|
xx | xx | xx | xx | xx | p1 | |
xx | xx | xx | xx | xx | p2 | |
xx | xx | xx | xx | xx | p3 | |
xx | xx | xx | xx | xx | p4 | |
xx | xx | xx | xx | xx | p5 |
v_referenceinfo
:
5. For v_referenceinfo(rfid, mid, reference,presCond)
Just like the table v_recipientinfo
, we insert presCond for corresponding product data.
6. For other tables:
For that rest of tables, since the whole table depends on a specific features, we will populate the table with user's email data based on which product group they belong to.