Cleaning up Event Queues - mitikov/KeepSitecoreSimple GitHub Wiki

Why is cleanup needed ?

The more records exist in EventQueue table, the more time and resources it takes for database engine to process requests.

Taking into account fresh events are fetched from all databases by all instances frequently, keeping obsolete entries in database is costly.

A symptom of having to much rows would be timeout errors logged:

Message: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Source: Sitecore.Kernel
at Sitecore.Data.DataProviders.Sql.DataProviderCommand.ExecuteReader()
.....
at Sitecore.Eventing.EventQueue.ProcessEvents(Action`2 handler)
at Sitecore.Eventing.EventProvider.RaiseQueuedEvents()

database having hard time

Who is responsible for cleanup ?

Sitecore.Tasks.CleanupEventQueue agent defined in configuration that is responsible for removing obsolete data.

A more aggressive cleanup policy was implemented in 7.2 Update-4 ref. #392673. It allows to specify number of minutes to keep.

In a later 7.2 Update-5 ref.448452 task was refactored to remove records by batches to avoid SQL Lock escalations

How much data should I keep ?

The less you keep, the better. The CMS doc suggests to keep less than 1000 records, so that would be a starting point.

Cleanup interval should remove only already processed records, thus must be more than the longest-running operation.

Example

OnPublishEndAsync strategy gets data to index from EventQueue after publish is over.
If publish operation took 3 hours, and intervalToKeep equals 2 hours, data published during first hour would not updated in the index.

Records in EventQueue table would be removed by CleanupAgent by the time publish is over.

Processing delay

Many threads can populate Event Queue table at a time:

Publish operation can be performed in many threads
Content editing can be performed by many users simultaneously

However Event Queue processing is done in one thread only, thus a processing delay could be introduced.

The following SQL can be used to see the current delay :

SELECT SUBSTRING(p.[Key],9,100) AS [Instance],
CONVERT(BINARY(8), CAST(CAST(p.[Value] AS NVARCHAR(10)) AS int )) AS [LastProcessedStamp],
(SELECT COUNT(*) FROM [EventQueue] WHERE [Stamp] > CONVERT(INT, CAST(p.[Value] AS NVARCHAR(8)))) AS [TODO],
(CASE WHEN (q.[Created] is null) THEN
( CONVERT(VARCHAR(24),(SELECT MAX([Created])-MIN([Created]) FROM EventQueue),20) )
ELSE
CONVERT(VARCHAR(24),(SELECT top(1) [Created] AS TopCreated FROM EventQueue order by [Stamp] desc) - (q.[Created]),20)
END ) AS [ProcessingDelay],
SUBSTRING(q.[EventType],0, CHARINDEX(',',q.[EventType])) AS [LastEventType],
q.[InstanceName] as [RaisedByInstance],
q.[UserName] as [RaisedByUser],
q.[Created] as [RaisedTime],
q.[InstanceData] as [LastEventData],
q.[Id] as [LastEqID]
FROM Properties p
FULL join EventQueue q
ON q.[Stamp] = CONVERT(BINARY(8), CAST(CAST(p.[Value] AS NVARCHAR(10)) AS int ))
WHERE p.[Key] LIKE 'EQStamp%'
order by TODO

ProcessingDelay column shows ~how much time would it take to show just published content.

Finding an optimal value for cleanup interval

On the one hand, the more records you store, the less chances you get to remove non-processed event and keep showing obsolete content till server restart.
On the other hand, excessive data brings performance costs that can be avoided.

The good pick would be to find longest running operation, and multiply the duration by 1.5 factor to be on the safe side.

Notes

OnPublishEndAsync strategy has threshold parameter (100 000 by default) and will trigger a full index rebuild when number of changes higher than predefined.

So it would be enough to measure the time it takes in your solution to generate 100 000 rows by publish, and multiply the value by 1.5 factor to add margin for a real life scenarios.