Cleaning up Event Queues - mitikov/KeepSitecoreSimple GitHub Wiki
Why is cleanup needed ?
The more records exist in EventQueue table, the more time and resources it takes for database engine to process requests.
Taking into account fresh events are fetched from all databases by all instances frequently, keeping obsolete entries in database is costly.
A symptom of having to much rows would be timeout errors logged:
Message: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Source: Sitecore.Kernel
at Sitecore.Data.DataProviders.Sql.DataProviderCommand.ExecuteReader()
.....
at Sitecore.Eventing.EventQueue.ProcessEvents(Action`2 handler)
at Sitecore.Eventing.EventProvider.RaiseQueuedEvents()
Who is responsible for cleanup ?
Sitecore.Tasks.CleanupEventQueue agent defined in configuration that is responsible for removing obsolete data.
A more aggressive cleanup policy was implemented in 7.2 Update-4 ref. #392673. It allows to specify number of minutes to keep.
In a later 7.2 Update-5 ref.448452 task was refactored to remove records by batches to avoid SQL Lock escalations
How much data should I keep ?
The less you keep, the better. The CMS doc suggests to keep less than 1000 records, so that would be a starting point.
Cleanup interval should remove only already processed records, thus must be more than the longest-running operation.
Example
OnPublishEndAsync strategy gets data to index from EventQueue after publish is over.
If publish operation took 3 hours, and intervalToKeep equals 2 hours, data published during first hour would not updated in the index.
Records in EventQueue table would be removed by CleanupAgent by the time publish is over.
Processing delay
Many threads can populate Event Queue table at a time:
- Publish operation can be performed in many threads
- Content editing can be performed by many users simultaneously
However Event Queue processing is done in one thread only, thus a processing delay could be introduced.
The following SQL can be used to see the current delay :
SELECT SUBSTRING(p.[Key],9,100) AS [Instance],
CONVERT(BINARY(8), CAST(CAST(p.[Value] AS NVARCHAR(10)) AS int )) AS [LastProcessedStamp],
(SELECT COUNT(*) FROM [EventQueue] WHERE [Stamp] > CONVERT(INT, CAST(p.[Value] AS NVARCHAR(8)))) AS [TODO],
(CASE WHEN (q.[Created] is null) THEN
( CONVERT(VARCHAR(24),(SELECT MAX([Created])-MIN([Created]) FROM EventQueue),20) )
ELSE
CONVERT(VARCHAR(24),(SELECT top(1) [Created] AS TopCreated FROM EventQueue order by [Stamp] desc) - (q.[Created]),20)
END ) AS [ProcessingDelay],
SUBSTRING(q.[EventType],0, CHARINDEX(',',q.[EventType])) AS [LastEventType],
q.[InstanceName] as [RaisedByInstance],
q.[UserName] as [RaisedByUser],
q.[Created] as [RaisedTime],
q.[InstanceData] as [LastEventData],
q.[Id] as [LastEqID]
FROM Properties p
FULL join EventQueue q
ON q.[Stamp] = CONVERT(BINARY(8), CAST(CAST(p.[Value] AS NVARCHAR(10)) AS int ))
WHERE p.[Key] LIKE 'EQStamp%'
order by TODO
ProcessingDelay column shows ~how much time would it take to show just published content.
Finding an optimal value for cleanup interval
On the one hand, the more records you store, the less chances you get to remove non-processed event and keep showing obsolete content till server restart.
On the other hand, excessive data brings performance costs that can be avoided.
The good pick would be to find longest running operation, and multiply the duration by 1.5 factor to be on the safe side.
Notes
OnPublishEndAsync strategy has threshold parameter (100 000 by default) and will trigger a full index rebuild when number of changes higher than predefined.
So it would be enough to measure the time it takes in your solution to generate 100 000 rows by publish, and multiply the value by 1.5 factor to add margin for a real life scenarios.