Email aggregation - Orodan/Hilary GitHub Wiki

In order to cut-down on the amount of e-mails Hilary sends the following is proposed:

The user can declare an e-mail preference. His options are:

Don't send me e-mails
Send me a daily aggregate overview (if there is anything to send)
Immediate

The first two are fairly self-explanatory, but the immediate one might not be super obvious. When this option is selected, the user will receive an e-mail shortly after the activity occurs, but a slight delay might be introduced.

Each activity type is able to declare how long the email aggregator should hold off on sending e-mails via the ActivityAPI.registerActivityType.

Rules

The aggregation rules are as follows:

The first activity determines when the next e-mail will be sent out
All activities between the first activity and the moment the mails gets sent, should aggregate
An e-mail can contain multiple activity types

Example

In all the examples, there are 2 configured activity types

comments (green)
content-shares with a longer aggregation time (blue)

Example 1

For example, imagine the following situation:

7 activities occur
- 3 comments
- 4 content shares

situation

Each increment on the t-axis is a possible point where Hilary will send e-mails. The lines above the axis are activities. The width of the activity determines how long it can aggregate with other activities. In the above example, you can see that Hilary will send 2 e-mails.

At t = 4
- One aggregate containing 4 content-shares
- One comment activity
At t = 6.
- One aggregate containing 2 comments

Example 2

4 activities occur
- 1 comments
- 3 content shares

situation

Because it's the first activity that determines when the next mail point will be, 2 e-mails will be sent out:

At t = 3
- One comment
- One content-share
At t = 6
- One aggregate containing 2 content-shares

Potential implementation

Much like activity aggregation, e-mail aggregation would run on a time-based interval. This would be a relatively small interval, e.g.: every 5 minutes

Data access / storage

The data needs to be stored in such a way that multiple nodes can perform email aggregation/sending. If we consider loosing e-mails when a machine drops out acceptable, we can probably get away with storing this information in Redis.

Who we need to send an e-mail at point t:

oae-activity:mail:#bucket:#t = set(users who we need to mail at t = #t)

What an e-mail for a user should contain:

oae-activity:mail:#bucket:#t:#userIdA = set(IDs of activities that need to go into this mail)
oae-activity:mail:#bucket:#t:#userIdB = set(IDs of activities that need to go into this mail)
..

What the next e-mail point is for a user

oae-activity:mail:#bucket:next = { userIdA => 5, userIdB => 7, ... }

On activity delivery

When an activity gets delivered to an activity stream that requires an e-mail we need to do the following (in pseudocode)

queueMailActivity = function(userId, activity):
    // Get the mail preferences for this user as it might not be necessary to queue anything
    mailPreferences = MailPreferencesDAO.getPreferences(userId);
    if mailPreferences.never:
         // We're done here
         return

    // A user always goes in the same bucket
    bucket = _getBucket(userId);

    // Determine if we've already scheduled a delivery for this user
    var nextScheduledDelivery = Redis.get(oae-activity:mail:#bucket:next[#userId])
    
    // If the user already had an activity scheduled, we can add this one in the set
    if nextScheduledDelivery is defined:
        Redis.insertInSet(oae-activity:mail:#bucket:#nextScheduledDelivery:#userId, activity.id)

    // If the user had no mail scheduled yet, we need to figure out for when we should schedule it.
    // This depends a bit on the mail preferences
    else:
        // If the user wants his mail "immediately" and this is the first activity that triggers an e-mail
        // we need to get the timeout from the registered activity type and schedule delivery in that timeout
        if mailPreference.immediate:
            mailTimeout = ActivityRegistery.getActivtyType(activity.type).mailTimeout
            nextScheduledDelivery = now + mailTimeout

        // If the user prefers daily aggregates, we schedule mail delivery for a fixed point during the day
        elif mailPreference.daily:
            nextScheduledDelivery = 0

        // Schedule the user for mail delivery at that point in time
        Redis.insertInSet(oae-activity:mail:#bucket:#nextScheduledDelivery, userId)

        // Add the activity for that user at that point in time
        Redis.insertInSet(oae-activity:mail:#bucket:#nextScheduledDelivery:#userId, activity.id)

E-mail aggregator ticks

The email aggregator should:

grab an unallocated bucket number
get the set of users that need e-mails at that point in time
Grab all the activity IDs for those users that need e-mail at that point in time
Grab all the activities from the users their activity streams in Cassandra (this could be skipped if we decided to store the full activity in Redis in stead)
Perform aggregation on the collected activities (as per existing rules)
Format the e-mail
Possibly embed any preview images (maybe out-of-scope here)
Clear all the values from Redis which are no longer relevant, maybe we could use a Redis TTL (where appropriate) to avoid this?