Dataset business Emails (Enron corpus) - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Dataset business Emails (Enron corpus)

Proposer: Aser Abdelrahman - @aserhany - [email protected]
Votes:
1. @chaoran-chen
2. @ssoima

Summary

Around 215.3 Billion emails [1] are sent/received every day, leaving many users, especially in businesses, confronted with the question about which mails they should read first, or even read at all. While searching for a large real database of business mails suitable for the prediction goals listed below, I found that the Enron Email dataset happens to be a very good candidate for the job. It contains data from around 150 users, mostly senior management of Enron, organized into folders. The dataset comprises around 0.5 Million emails (1.7 GB). The data was made public by the Federal Energy Regulatory Commission in the united states during its investigation, after Enron went bankrupt due to corruption.

Prediction Goals

How important is a particular business email?
Would the sender expect a reply? If yes, how long would it take?
Discover interesting patterns in business mails.
Email classification: e.g. Importance, escalation, alerts.
...

Long Description

1 - Dataset Description:

Emails in text format including standard email meta data
Email folders for each of the 150 users (e.g.)
Email example:

-----Original Message----- From: Dunton, Heather
Sent: Tuesday, December 04, 2001 3:12 PM To: Belden, Tim; Allen, Phillip K. Cc: Driscoll, Michael M. Subject: West Position

Attached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24

<< File: west_delta_pos.xls >>

Let me know if you have any questions.

Heather

2 - Possible features:

Email's length
Subject's length
Number of attachments
Number of recipients (To, CC, BCC)
Folders (e.g. deleted mails)
Sending time, day (e.g. consider holidays)
In case of a replies: time to reply
Check for particular key words in subject and email (e.g. „important“, „urgent“, Names mentioned)
Making use of threads
Tags: e.g. highly important
Specific characters in mails (e.g. “!”, “?”, “@”)
Mail fonts
Smiley faces
Frequency of received emails from particular mail address
Frequency of sent emails from particular mail address
...

Data source:

Free available for the public for research and educational purposes at (http://www.cs.cmu.edu/~./enron/)
MySQL database available for querying the dataset at http://www.ahschulz.de/enron-email-data/

Research uses of the dataset [2]:

Carnegie Mellon University (Language technology institute):
Folder classification: exploring how to classify messages as organized by a human
Email folder prediction: Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams
Learning to classify Email into „Speech Acts“ – Text mining
University of southern California:
MySQL database, minor cleaning (e.g. deletion of duplicate mails, substitution of invalid email addresses) and statistical analysis of the data.
Discovering important nodes through Graph entropy, the case of Enron Email Database
University of Pennsylvania: a query dataset for email search as well as a tool for generating spelling errors
University of Texas at Austin: Crowndsourcing Evaluations of classifier interpretability
University of Illinois at Chicago: Meaningful selection of temporal resolution for dynamic networks
Computational & Mathematical organization theory - structure in Enron Email dataset:
Using singular value decomposition and semi discrete decomposition
Messages fall into two groups: short messages and rare words versus long messages and common words
Relationships among individuals based on their patterns of word use in email
Word use is correlated to function within the organization

Note: Despite the interesting research conducted using the Enron emails dataset, I didn't find any work and/or data mining approach during my research regarding the prediction goals listed above. Nevertheless, it can be indeed useful to respectively look into some of those papers (considered as related literature) in case the data set gets chosen. This could enhance the understanding of the dataset and therefore contribute to the quality of the project.

Sources

[1] http://www.radicati.com/wp/wp-content/uploads/2015/02/Email-Statistics-Report-2015-2019-Executive-Summary.pdf
[2] Research papers are attached in the respective zip file here