Dataset business Emails (Enron corpus) - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Dataset business Emails (Enron corpus)
- Proposer: Aser Abdelrahman - @aserhany - [email protected]
- Votes:
- @chaoran-chen
- @ssoima
Summary
Around 215.3 Billion emails [1] are sent/received every day, leaving many users, especially in businesses, confronted with the question about which mails they should read first, or even read at all. While searching for a large real database of business mails suitable for the prediction goals listed below, I found that the Enron Email dataset happens to be a very good candidate for the job. It contains data from around 150 users, mostly senior management of Enron, organized into folders. The dataset comprises around 0.5 Million emails (1.7 GB). The data was made public by the Federal Energy Regulatory Commission in the united states during its investigation, after Enron went bankrupt due to corruption.
Prediction Goals
- How important is a particular business email?
- Would the sender expect a reply? If yes, how long would it take?
- Discover interesting patterns in business mails.
- Email classification: e.g. Importance, escalation, alerts.
- ...
Long Description
1 - Dataset Description:
- Emails in text format including standard email meta data
- Email folders for each of the 150 users (e.g.)
- Email example:
-----Original Message-----
From: Dunton, Heather
Sent: Tuesday, December 04, 2001 3:12 PM
To: Belden, Tim; Allen, Phillip K.
Cc: Driscoll, Michael M.
Subject: West Position
Attached is the Delta position for 1/18, 1/31, 6/20, 7/16, 9/24
<< File: west_delta_pos.xls >>
Let me know if you have any questions.
Heather
2 - Possible features:
- Email's length
- Subject's length
- Number of attachments
- Number of recipients (To, CC, BCC)
- Folders (e.g. deleted mails)
- Sending time, day (e.g. consider holidays)
- In case of a replies: time to reply
- Check for particular key words in subject and email (e.g. „important“, „urgent“, Names mentioned)
- Making use of threads
- Tags: e.g. highly important
- Specific characters in mails (e.g. “!”, “?”, “@”)
- Mail fonts
- Smiley faces
- Frequency of received emails from particular mail address
- Frequency of sent emails from particular mail address
- ...
Data source:
- Free available for the public for research and educational purposes at (http://www.cs.cmu.edu/~./enron/)
- MySQL database available for querying the dataset at http://www.ahschulz.de/enron-email-data/
Research uses of the dataset [2]:
- Carnegie Mellon University (Language technology institute):
- Folder classification: exploring how to classify messages as organized by a human
- Email folder prediction: Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams
- Learning to classify Email into „Speech Acts“ – Text mining
- University of southern California:
- MySQL database, minor cleaning (e.g. deletion of duplicate mails, substitution of invalid email addresses) and statistical analysis of the data.
- Discovering important nodes through Graph entropy, the case of Enron Email Database
- University of Pennsylvania: a query dataset for email search as well as a tool for generating spelling errors
- University of Texas at Austin: Crowndsourcing Evaluations of classifier interpretability
- University of Illinois at Chicago: Meaningful selection of temporal resolution for dynamic networks
- Computational & Mathematical organization theory - structure in Enron Email dataset:
- Using singular value decomposition and semi discrete decomposition
- Messages fall into two groups: short messages and rare words versus long messages and common words
- Relationships among individuals based on their patterns of word use in email
- Word use is correlated to function within the organization
Note: Despite the interesting research conducted using the Enron emails dataset, I didn't find any work and/or data mining approach during my research regarding the prediction goals listed above. Nevertheless, it can be indeed useful to respectively look into some of those papers (considered as related literature) in case the data set gets chosen. This could enhance the understanding of the dataset and therefore contribute to the quality of the project.
Sources
- [1] http://www.radicati.com/wp/wp-content/uploads/2015/02/Email-Statistics-Report-2015-2019-Executive-Summary.pdf
- [2] Research papers are attached in the respective zip file here