How to crawl the full SharePoint Online tree using a Java‐based solution - krickert/search-api GitHub Wiki
How to crawl the full SharePoint Online tree using a Java-based solution (Micronaut framework), ensuring it integrates with your Okta-based authentication and Azure ACLs. The approach will include initial full crawls for seeding and real-time updates via change tracking. I’ll also explore best practices for tracking crawl history to enable efficient recrawls. I’ll get back to you with a structured approach and necessary tools/libraries soon.
Crawling a SharePoint Online tenant in Java (Micronaut framework) involves using Microsoft’s APIs to enumerate content, then capturing changes continuously. The solution will use a service account (or app principal) with broad read access, authenticating via Okta (as the Identity Provider federated with Azure AD) to obtain the necessary OAuth tokens. The crawler will perform an initial full crawl to seed a search index, then subscribe to real-time changes (via Microsoft Graph change notifications or delta queries) for incremental updates. It will maintain a crawl history state to avoid re-fetching unchanged data, and publish structured content to AWS MSK (Kafka) for downstream indexing in Solr. Below is a breakdown of the approach, including authentication, crawl implementation, Kafka integration, and best practices (incremental updates, rate limiting, etc.), with recommended libraries for each part.
Azure AD App Registration: Start by registering an Azure AD application with the permissions needed to read SharePoint content. For full tenant-wide crawling, grant app-only permissions such as Sites.Read.All
or Sites.FullControl.All
(the latter may be needed to discover all sites) ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Sites)) ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Files)). You can also use the granular Sites.Selected
permission to limit access to specific site collections if full access is a concern ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=)). The app principal (service account) should be given read access across the tenant (either via the app permissions or by being added as a site admin on all sites, though app permissions are the preferred approach).
Okta Integration: If your organization uses Okta as the identity provider for Office 365, you’ll need to obtain an access token for Microsoft Graph via Okta. In many cases, Azure AD is federated with Okta for user authentication – but for app-only authentication, you can often bypass Okta and go directly to Azure AD’s token endpoint using the client credential flow. The typical approach is:
-
Client Credential OAuth2 Flow: Use the Azure AD OAuth token URL (e.g.
https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token
) with the app’s client ID and client secret (or certificate) to get a Graph API token. This doesn’t require a user context, so it can run unattended. If Okta is involved in the OAuth flow, you may need to configure Okta as an OAuth authorization server that trusts your Azure AD app, or use Okta’s API to retrieve a SAML assertion/cookie and exchange it for a SharePoint token. However, using direct Azure AD app auth is simpler for a service account. -
Okta to Azure Token Exchange (if needed): In some setups, Okta can be granted admin consent to call Graph on behalf of your tenant ([Provide Microsoft admin consent for Okta | Okta](https://help.okta.com/en-us/content/topics/apps/apps_o365_admin_consent.htm#:~:text=Provide%20consent%20for%20Okta%20to,information%20provided%20by%20Office%20365)). Okta’s API or SDK (e.g. using an Okta OAuth service app) can authenticate the service account and provide a token usable against Azure AD/Graph. One method (if only interactive SAML is available) is to use Okta’s authentication API to get a session token and then follow the redirect to get SharePoint cookies (FedAuth) ([saml - authenticate to SharePoint through OKTA from back-end service - Stack Overflow](https://stackoverflow.com/questions/37140940/authenticate-to-sharepoint-through-okta-from-back-end-service#:~:text=Here%20is%20what%20I%20did,okta%20authorization%20token%20for%20that)) ([saml - authenticate to SharePoint through OKTA from back-end service - Stack Overflow](https://stackoverflow.com/questions/37140940/authenticate-to-sharepoint-through-okta-from-back-end-service#:~:text=5,parse%20an%20html%20file%20again)), but this is complex. Recommended: leverage Azure AD app-only flow with client credentials, which avoids needing to simulate an Okta login. This can be achieved in Micronaut by using the Microsoft Authentication Library (MSAL) for Java or Azure Identity SDK (for example, using
ClientSecretCredential
orClientCertificateCredential
to acquire tokens).
Microsoft Graph Java SDK: Use the official [Microsoft Graph SDK for Java](https://github.com/microsoftgraph/msgraph-sdk-java) for convenient API access. This SDK can handle authentication via a provider you supply (you’d implement an AuthenticationProvider
that supplies the OAuth token). The Graph SDK provides high-level classes to enumerate sites, lists, list items, drive items, etc. – which is ideal for crawling. (Alternatively, you can use Micronaut’s HTTP client or Apache HttpClient to call the REST endpoints directly, but the SDK will save time.) Ensure the token has the “Sites.Read.All” (or higher) scope so Graph can read sites, lists, and files. With proper permissions, Graph supports access to SharePoint sites, lists, and drives (document libraries), including reading list items and documents ([Working with SharePoint sites in Microsoft Graph - Microsoft Graph v1.0 | Microsoft Learn](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0#:~:text=The%20SharePoint%20API%20in%20Microsoft,supports%20the%20following%20core%20scenarios)).
Handling Azure ACLs: To enforce security trimming in search results, capture the Access Control Lists (ACLs) for each item. Azure AD/SharePoint permissions can be retrieved via Graph or SharePoint API. For example, Graph’s drive and list item resources have a permissions
relationship that lists who has access (e.g., users, Azure AD groups, sharing links) – you can call /sites/{siteId}/drives/{driveId}/items/{itemId}/permissions
for each file. This may show individual permissions and share links. However, for full fidelity (including role inheritance and group membership), you might need to call SharePoint’s REST API (or Microsoft Graph beta) to get role assignments. An approach is to use the SharePoint REST endpoint /_api/web/GetFileById('<id>')/ListItemAllFields/RoleAssignments
which returns the principals and roles with access. This extra ACL info can be included in the data sent to Kafka so the indexing process can store it for search trimming. Ensure your service principal has rights to read permissions (the Sites.FullControl.All permission will allow reading permissions, whereas read-only might not expose all ACL info ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Therefore%2C%20Glean%20requires%20FullControl%20to,is%20best%20for%20our%20customers))).
The initial crawl will traverse the entire SharePoint Online hierarchy accessible to the service account and extract all content and metadata for indexing. Key steps for the full crawl:
-
Discover All Site Collections: Use Microsoft Graph to list sites. Graph provides an endpoint to search for sites:
GET /sites?search=*
which can return all site collections (and perhaps subsites) to which the app has access. Another option is to query the SharePoint Admin API to get all site URLs, but withSites.FullControl.All
, Graph can auto-discover new sites ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Sites)). If usingSites.Selected
, you would maintain a list of site IDs/URLs to crawl (perhaps provided to your app). Retrieve the list of site IDs and URLs as the starting points. -
Enumerate Content in Each Site: For each SharePoint site:
-
Lists and Libraries: List all lists (
GET /sites/{siteId}/lists
). Identify document libraries (listbaseTemplate
== 101) vs. other lists. For document libraries, you can either use the Lists API or the Drives API. Document libraries are exposed as Drives in Graph, which is convenient for files. For example, the default Documents library of a site can be accessed viaGET /sites/{siteId}/drive/root/children
(root folder items) and recursively enumerated. Other document libraries will appear inGET /sites/{siteId}/drives
. You can iterate through each drive’s folders and files. For each file (driveItem), collect its metadata (name, URL, lastModifiedDateTime, etc.) and if needed download the content stream viaGET /sites/{siteId}/drive/items/{itemId}/content
. For list items (non-document lists or site pages library), useGET /sites/{siteId}/lists/{listId}/items
withexpand=fields
to get list columns. This returns items with their field values. -
SharePoint Pages: SharePoint modern pages are typically stored in a Site Pages library (which is a document library). They can be fetched like other drive items in that library. Classic pages might be in the “Pages” library. Ensure to include those libraries in the crawl if present so that site pages (HTML content) can be indexed.
-
Permissions (ACLs): As mentioned, for each item or each library, retrieve permissions. You might do this on a second pass or lazily, since pulling ACLs for every item can be heavy. One strategy is to fetch at least the library-level or site-level permission info (which usually applies to all items unless broken inheritance), and only fetch item-specific permissions if an item has unique ACL (Graph driveItem has an
@microsoft.graph.conflictBehavior
or a flag if permissions are inherited vs unique). The Microsoft Graph might not directly tell if an item has unique permissions, so you may need to call SharePoint REST forHasUniqueRoleAssignments
. Depending on your indexing needs, you could index at the library level security (most content inherits that) and handle exceptions for unique ACL items. -
Data Extraction: For files (like PDFs, Word docs), you may need their text content for full-text search. You can either extract text in the crawler (e.g., download the file and run a text extractor like Apache Tika, then send the text to Kafka) or send the raw file content through Kafka to be parsed later. Often it’s better to do text extraction in the indexing pipeline to offload work from the crawler. At minimum, send the file’s URL or an ID so the indexer can fetch content if needed. If the crawler should handle it, be mindful of binary size when publishing to Kafka. You might convert documents to text and include the text in the message, or store the file in a blob store and send a reference.
-
-
Pagination & Throttling: The Graph API will page results (often 200-1000 items per page for drives and lists). The Graph SDK or REST calls will return an
@odata.nextLink
when more results are available. Continue paging until all items are retrieved. Implement throttle handling – if Graph returns HTTP 429 or 503 with "Retry-After", respect that and pause accordingly. During a full crawl of a large tenant, it’s easy to hit Graph’s throughput limits, so you may need to crawl in segments (e.g., one site at a time, with slight delays or concurrency control). Microsoft notes that using a delta approach (explained below) can reduce the data transfer and risk of throttling ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%2C%20also%20called%20change,with%20a%20local%20data%20store)) ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Using%20delta%20query%20helps%20you,of%20the%20requests%20being%20throttled)). For the initial crawl, you have to fetch everything once, but do so efficiently: use$select
to only retrieve fields you need (e.g., file name, URL, lastModified, etc., plus fields for lists) and$expand=fields
judiciously to avoid extra calls per item. -
Parallelism: Micronaut can run tasks asynchronously – consider crawling multiple sites in parallel threads to speed up the seed crawl, but be cautious not to overrun API limits. A reasonable approach is to process one site (or a few small sites) at a time, and within a site parallelize fetching different document libraries concurrently. The Kafka backend can absorb out-of-order messages, but ensure you don’t overload the network. Monitor memory as well if you store a lot of data before sending to Kafka; it’s better to stream items out as you retrieve them.
Each piece of content retrieved (documents, list items, pages) should be transformed into a structured JSON (or Avro) message including key metadata: e.g., id
, siteId
, path/url
, title
, lastModified
, created
, author
, file type
, content (or content excerpt)
, and ACLs (list of allowed users/groups)
. Then publish it to Kafka (details below). This full crawl will feed the search indexer with an initial complete dataset.
After seeding, the crawler should switch to incremental update mode. This ensures the search index stays up-to-date as users add or modify SharePoint content. There are two complementary strategies for this: Graph change notifications (webhooks) for push-based updates, and delta queries for pull-based incremental sync. A combined approach can offer real-time responsiveness with reliability:
-
Microsoft Graph Change Notifications (Webhooks): You can subscribe to changes on SharePoint content via Graph’s webhook mechanism. Graph supports subscriptions on SharePoint list resources – which include document libraries and generic lists. For example, you can subscribe to a specific document library: resource path
/sites/{site-id}/lists/{list-id}
(this covers any changes to items in that list) ([Set up notifications for changes in resource data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/change-notifications-overview#:~:text=list%20under%20a%20SharePoint%20site,id%7D%2Fconversations%60)). When any file or item in that list is added, updated, or deleted, Graph will send an HTTPS POST notification to your endpoint. In Micronaut, you can create a controller to receive these callbacks. When you receive a change notification, it will include the list or drive and item IDs that changed (depending on the notification type and if you requested resource data). Use this info to fetch the changed item from Graph and index it, or mark that item for re-crawling. For instance, if a notification indicates a file was modified or created, your crawler can fetch that file’s metadata (and content if needed) immediately and send an update event to Kafka. Graph notifications can also alert on deletions (so you can remove or mark an item in the index as deleted).Setting up Graph webhooks requires a few steps:
- Your Micronaut service must expose an HTTPS endpoint reachable by Graph (if running on-prem, you might need to expose through a public URL or use an Azure Function/Webhook Relay). During subscription creation (
POST /subscriptions
), Graph will send a validation token that your endpoint must echo back to verify ownership. Micronaut can easily handle this in a controller route. - Subscriptions have a limited lifetime and must be renewed. For SharePoint list resources, the maximum is around 42,300 minutes (~29 days) ([Set up notifications for changes in resource data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/change-notifications-overview#:~:text=SharePoint%20list%2042%2C300%20minutes%20,under%20seven%20days)). Many other resources have shorter limits (e.g., 3 days for OneDrive in older versions), so check the latest limits. You should implement a scheduled job to renew subscriptions before they expire.
- You will need the Files.ReadWrite.All permission in Graph to subscribe to drive/list item changes ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=To%20set%20up%20and%20maintain,All)) (this is a quirk of Graph requiring a read-write scope to create the subscription, even if you only read data). Ensure your app has this if using webhooks ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Files)).
- Be mindful of subscription count limits. There may be limits per app or tenant (Graph documentation notes a default limit like 100 subscriptions per app per tenant for certain resources ([Set up notifications for changes in resource data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/change-notifications-overview#:~:text=Note))). If your tenant has hundreds of site libraries, you might not subscribe to all. One approach is to at least subscribe to high-activity sites (e.g., critical libraries), or use a rolling subscription strategy. Alternatively, subscribe at the site level if possible. (Graph doesn’t explicitly let you subscribe to “any change in any site” in one go; it’s usually per list or per drive).
Processing Notifications: On receiving a notification, you get minimal info (typically the affected item’s ID and the list/site). Immediately use Graph API to fetch the latest state of that item (for example, call
GET /sites/{siteId}/lists/{listId}/items/{itemId}?expand=fields
or.../drive/items/{itemId}
if it’s a file). Then update your index via Kafka. This should be fast – do small lookups per notification. If many changes come in at once, you might batch them or process concurrently, but ensure ordering where necessary (e.g., a delete after an update). - Your Micronaut service must expose an HTTPS endpoint reachable by Graph (if running on-prem, you might need to expose through a public URL or use an Azure Function/Webhook Relay). During subscription creation (
-
Microsoft Graph Delta Queries (Change Tracking): Delta queries allow you to query a resource and get only the changes since the last query (using a token). This is useful to periodically sync changes without having to rescan everything. For SharePoint: Graph now supports delta on driveItems (document library files) and (in beta/v1.0) on listItems ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=match%20at%20L34%20,is%20approved%20and%20recommended%20with)) ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=Get%20initial%20delta)). The typical pattern is:
- Perform an initial
GET /sites/{siteId}/lists/{listId}/items/delta
query. The first call, if no delta token is provided, will return all items in that list (much like a full read), plus an@odata.deltaLink
URL ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=To%20get%20the%20initial%20delta,to%20above%E2%80%99s%20query)). The deltaLink is essentially a URL with a token that represents the state of the data at the time of query ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=match%20at%20L161%20The%20token,would%20simply%20reset%20and%20return)). - Save that delta link (or the token within it). On the next run, call the same delta URL. If nothing changed, the response will be empty (no new or modified items) ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=previous%20query%20combined%20with%20a,specific%20delta%20token)). If items changed or were added/deleted, the response will contain only those changed records since last time, and a new deltaLink for the current state. This is very efficient: “It will return an empty result set as long as nothing changes. Once an item changes it will be returned.” ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=previous%20query%20combined%20with%20a,specific%20delta%20token)).
- Repeat this at an appropriate interval (could be a scheduled job, e.g., every few hours or daily for each site/list). This way, you can catch changes that might be missed by webhooks (for example, if your webhook service was down or a subscription expired). It also handles large batches of changes gracefully. Microsoft’s documentation recommends delta queries as a way to “discover newly created, updated, or deleted entities without performing a full read…reducing the amount of data [requested] and likely limiting the chances of being throttled.” ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%2C%20also%20called%20change,with%20a%20local%20data%20store)). In other words, use delta to avoid expensive full crawls repeatedly.
You can use delta queries on drives as well:
GET /sites/{siteId}/drive/root/delta
will track changes in a document library’s files. This might actually be easier for document libraries than subscribing to every drive via webhook. If using delta, maintain the last delta token per library. The Graph SDK can help manage this (it will expose methods to get delta and follow nextLinks). - Perform an initial
Hybrid Approach: A best practice is to combine webhooks and delta. Webhooks give you near-real-time pushes for changes, and delta gives you a reliable way to periodically reconcile state. For example, the search company Glean describes that they use drive item webhooks to know which drives changed, and then schedule an incremental crawl (delta sync) on those drives, as well as a periodic full sync: “Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally… Glean schedules an incremental crawl over all drives with a webhook seen recently, or if the drive has not been crawled for a period of time (by default at least weekly)” ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=,change%20to%20a%20single%20item)). This means if a particular site or library has a lot of changes, the webhook ensures they sync immediately; if a site has no webhook (not subscribed) or missed events, a weekly delta crawl will still catch up on changes. You can implement a similar strategy: use webhooks for rapid updates on high-value content, and run a scheduled incremental scan (using delta or change log) for everything on a schedule (e.g., nightly or weekly) to pick up any missed updates or new sites that weren’t subscribed yet.
Micronaut’s scheduling features (@Scheduled
) can be used to run these periodic delta sync jobs. Also, Micronaut’s reactive or async capabilities can handle processing webhook events concurrently. Ensure thread-safety when updating the stored crawl state (delta tokens etc.).
To prevent redundant indexing and know what to crawl, the system should maintain a crawl history (state store of what has been seen and when). Key elements:
-
Delta Tokens / Change Tokens: If using Graph delta, the
@odata.deltaLink
(or token within it) is effectively your bookmark for that resource. Store this per resource (e.g., per list or drive). For example, maintain a table or in-memory cache where each SharePoint list (identified by siteId+listId or driveId) has its last delta token and last crawl timestamp. This allows the next incremental job to use the token to get only changes since last time ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=The%20second%20approach%20we%20will,link%20is%20the%20combination%20of)). If you get a fresh deltaLink after each sync, update the stored token. In case the token expires or becomes invalid (which can happen if too much time passed or Graph can’t find the state), you should fall back to a full resync of that resource to restore the delta cycle. -
Item Hash or ETag: Another technique is storing an ETag or content hash for each item you’ve indexed. SharePoint Graph objects often have an
eTag
property (especially files) ([Working with SharePoint sites in Microsoft Graph - Microsoft Graph v1.0 | Microsoft Learn](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0#:~:text=%22createdDateTime%22%3A%20%222016,%7D)). You can cache the last eTag or lastModifiedDateTime that was indexed for each item ID. Then on recrawl, skip items whoselastModifiedDateTime
has not changed since the last indexed version. However, doing this at scale requires a fast lookup (maybe a local database keyed by item ID). This is feasible but can be heavy. More efficiently, rely on Graph to tell you what's changed (delta or search queries by modified date). -
Change Log (SharePoint): SharePoint internally has a change log. If not using Graph, you could use SharePoint’s
GetChanges
API on the site or list. For example,/_api/site/GetChanges
with a change token. This returns changes (adds/edits/deletes) since a given change token. It’s similar in concept to Graph delta. You’d still need to track the last change token. This is an alternative if you needed to use SharePoint SOAP/REST directly. In general, Graph delta is preferred as it’s simpler and integrates with the Graph permissions model ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%20uses%20a%20pull,notifies%20the%20application%20of%20changes)). -
Avoiding Redundant Work: Use the stored history to scope recrawls. For example, if a site hasn’t changed (no new items in delta), you can skip recrawling it. If an item was already indexed and unchanged, no need to send it to Kafka again. This saves bandwidth and index processing. The combination of webhooks + delta means you’ll mostly only handle changed items. If neither mechanism is available, you might resort to comparing timestamps: e.g., query Graph for items modified in the last day, etc., but Graph supports
$filter=lastModifiedDateTime ge {timestamp}
only on some endpoints and that could still be a large query. Storing a crawl history is more efficient. -
Deleted Items: Tracking deletions is crucial – otherwise the search index will show results that no longer exist. Graph delta queries will include tombstone records for deletions (often an item with an
@removed
field in the response). Webhook notifications for deletions will include achangeType: deleted
. Your crawler should catch these and send a message to Kafka/Solr to delete or mark the document as removed. Maintain a list of deleted IDs if needed to ensure they are processed. If a delta token approach is used, it inherently tells deletions since last check ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%2C%20also%20called%20change,with%20a%20local%20data%20store)). If doing manual tracking, you might occasionally do a full compare of indexed IDs vs current to catch any missed deletion.
In summary, design a small CrawlState store (maybe a Micronaut-managed bean that uses a database or even a file) which keeps: last crawl time, last delta token per list, and possibly a high-watermark of item count or IDs. This will guide incremental crawl runs and prevent re-indexing content that hasn’t changed.
Once the data is extracted from SharePoint, the crawler needs to publish it to Kafka (Amazon MSK is just managed Kafka, so treat it as Kafka from the client perspective). Here’s how to integrate:
-
Data Structure: Define a clear schema for the messages. Using JSON is common for search indexing pipelines, or Avro for a strongly typed schema. The message should include all necessary fields for indexing in Solr. For example:
{ "id": "<unique id>", "siteId": "...", "siteName": "...", "listId": "...", "path": "/sites/Team/Shared Documents/filename.docx", "title": "filename.docx", "content": "<extracted text or summary>", "author": "John Doe", "created": "2021-01-01T12:00:00Z", "modified": "2021-06-01T08:30:00Z", "fileType": "docx", "url": "https://tenant.sharepoint.com/sites/Team/Shared%20Documents/filename.docx", "acl": ["User:AADGUID1", "Group:AADGUID2", "..."], "changeType": "add/update/delete" }
Include a field for
changeType
or similar so the consumer knows if this is a new/updated document or a delete (in case of deletes, you might omit content). Theid
could be the SharePoint item unique ID or a composite of siteid_itemid, just something stable to use as Solr document key. -
Micronaut Kafka Client: Micronaut has integration for Kafka (e.g., using
@KafkaClient
to produce messages). You can create a producer class, for example:@KafkaClient public interface SharePointProducer { @Topic("sharepoint-crawl") void sendCrawlEvent(String key, @KafkaKey String id, CrawlEvent event); }
Micronaut will handle creating the Kafka producer under the hood. Alternatively, use the native Kafka Producer API (org.apache.kafka.clients.producer.KafkaProducer). Micronaut’s config can supply the Kafka bootstrap servers (point it to your MSK brokers and provide authentication config if needed, e.g., SASL credentials).
-
Key and Partitioning: Use a Kafka message key – for instance, the
id
of the item – so that updates to the same document go to the same partition (ordering of events per key is then guaranteed). This ensures if a create and an update come in sequence, they end up in order for the consumer. If ordering across the whole topic isn’t required, this per-key ordering is sufficient for correctness on a per-document basis. -
Throughput Considerations: The crawler can send messages as it goes to avoid buffering too much in memory. If using the Graph SDK in a reactive stream or simply iterating, after retrieving an item’s data, call the Kafka producer to send the JSON. You might batch multiple items into one Kafka message if that suits your indexing pipeline, but usually one item per message is easier. For large binary content, consider not sending the raw file via Kafka due to size. If needed, you could send a reference (like the SharePoint URL or an S3 pointer if you choose to copy it) and have the indexer fetch it. Since Solr can index attachments via Apache Tika, you might also send the binary in a field if needed, but ensure Kafka and Solr can handle the size (maybe compress or base64 encode if small).
-
Error Handling: Ensure to handle any exceptions from Kafka (e.g., broker downtime). Micronaut Kafka client can be configured with retries. You don’t want to lose data, so implement a retry or DLQ (Dead Letter Queue) for messages that consistently fail to send. Usually, Kafka is durable enough that as long as the brokers are up, the send will succeed. You can also send with acks=all to ensure it’s committed.
-
Consumption in Solr: (Though the question doesn’t focus on Solr side, for completeness:) You likely have a separate consumer application that reads from the Kafka topic and updates Solr (perhaps using SolrJ or API). Make sure the message format aligns with what that consumer expects. Using a well-defined schema (even registering it in Schema Registry if using Avro) can help decouple the crawler from the indexer.
Managing incremental updates efficiently and staying within API limits is crucial:
-
Use Delta and Webhooks to Minimize Load: As described, rely on change notifications and delta queries so you’re not repeatedly pulling full data. “Delta query…requests only data that changed since the last request…reducing the cost of the request and likely limit the chances of the requests being throttled.” ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Using%20delta%20query%20helps%20you,of%20the%20requests%20being%20throttled)) This means your incremental crawls should usually be dealing with a small subset of items (only those changed). This keeps your Graph API calls light.
-
Exponential Backoff on Throttling: Microsoft Graph enforces throttling if you exceed certain thresholds (which are not always documented but generally, bursts of too many requests will return 429). Implement a handler: if a 429 or 503 is returned with a
Retry-After
header, pause for that duration. If none is given, use an exponential backoff (e.g., wait 1s, then 2s, 4s, etc.). The Graph SDK may handle some of this for you, but be prepared to catch GraphServiceException. Design the crawler to be resilient – e.g., if a delta sync fails due to throttle, it should retry after waiting, not skip those changes. -
Parallelism vs. Throttling: Find a balance in how many Graph calls you do in parallel. You might parallelize by site or list, but Graph might count all calls from the app together. One approach is to use a small thread pool (say 5-10 threads) for Graph calls and tune based on observed throttle behavior. Also, combine calls where possible – for instance, use batch requests (Graph supports a batch API to put multiple calls in one HTTP request) if you need to fetch many individual items by ID triggered from notifications. But note the batch has its own limits (20 sub-requests per batch request).
-
Subscription Management: Keep track of your active subscriptions (if using webhooks). You might store them (subscription ID, resource, expiration) in a DB or config. On service restart, you may need to re-subscribe or at least know existing ones. A best practice is to have an automated process to renew or recreate subscriptions regularly (perhaps a daily check). Also handle the notification validation and security (Graph posts a validation token and expects response in plaintext; Graph also signs notifications if you set up client certificates for verification – consider this for security).
-
Initial vs Ongoing Crawl Infrastructure: The initial full crawl might be a one-time heavy operation. It could even be implemented as a separate tool or one-off job. The ongoing crawler is a continuously running service. Ensure once the full import is done, the system transitions to the incremental mode gracefully. You might include a flag in each message indicating whether it’s from full crawl or incremental, though the downstream likely doesn’t care.
-
Dealing with New Sites or Libraries: If new site collections are created in SharePoint, a full crawl app with Sites.FullControl should discover them (Graph search for sites might eventually list them). To be proactive, you can periodically run a site discovery (e.g., daily query for all sites) to catch new ones, then initiate a crawl on those. If using Sites.Selected permission, you’d need an external trigger (admin tells the crawler about the new site and you add it to crawl list). Make sure to handle those so no content is missed.
-
Micronaut Specific: Leverage Micronaut’s strengths – its HTTP client can be used if not using Graph SDK (Micronaut HTTP client is reactive and can easily call Graph endpoints with OAuth token). Micronaut’s dependency injection will help manage singletons like the Graph client or Kafka client. Also consider using Micronaut’s configuration to externalize things like API keys, Okta domain, Kafka brokers, etc., and its built-in support for AWS parameter store or Vault if needed for secrets.
-
Testing & Debugging: Use Microsoft Graph Explorer or Postman with your app’s credentials to test queries (making sure your app’s permissions are correctly consented). This helps verify that you can list sites, get items, delta links, etc. For Kafka, test the end-to-end by sending a sample message to a dev topic and ensure the Solr indexing picks it up.
By following this approach – an initial comprehensive crawl, followed by event-driven and token-based incremental updates, with careful tracking of state – you’ll have an efficient crawler that keeps your search index in sync with SharePoint. This Java/Micronaut solution, combined with Microsoft Graph (for SharePoint data) and Okta (for authentication), will be scalable and maintainable. It embraces Microsoft’s latest APIs and change tracking capabilities to avoid reprocessing unchanged content and to stay within rate limits, while reliably delivering content messages to Kafka for indexing.
References:
- Microsoft Graph supports full access to SharePoint sites, lists and document libraries via its REST API and SDK ([Working with SharePoint sites in Microsoft Graph - Microsoft Graph v1.0 | Microsoft Learn](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0#:~:text=The%20SharePoint%20API%20in%20Microsoft,supports%20the%20following%20core%20scenarios)). This allows reading all items and files (with application permissions for a service app).
- Microsoft Graph delta queries enable tracking changes (new/updated/deleted items) without full recrawl, reducing data transfer and avoiding throttling ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%2C%20also%20called%20change,with%20a%20local%20data%20store)) ([Use Microsoft Graph delta approach to increase performance getting SharePoint list items – Markus Moeller's SharePoint and M365Dev Blog](https://mmsharepoint.wordpress.com/2022/08/22/use-microsoft-graph-delta-approach-to-increase-performance-getting-sharepoint-list-items/#:~:text=previous%20query%20combined%20with%20a,specific%20delta%20token)). After an initial
/delta
call, using the returned delta token will yield only changes since last sync. - Graph change notifications (webhooks) can deliver real-time updates for SharePoint list items and drive items. Subscribing to list resources (such as a document library) will notify the app of any changes in that list ([Set up notifications for changes in resource data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/change-notifications-overview#:~:text=list%20under%20a%20SharePoint%20site,id%7D%2Fconversations%60)). Webhooks combined with periodic incremental crawls ensure up-to-date indexing with minimal delay ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=,change%20to%20a%20single%20item)).
- To use Graph webhooks on OneDrive/SharePoint content, your app needs sufficient permissions (e.g., Files.ReadWrite.All) and you must renew subscriptions periodically (up to ~30 days for SharePoint lists) ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=To%20set%20up%20and%20maintain,All)) ([Set up notifications for changes in resource data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/change-notifications-overview#:~:text=SharePoint%20list%2042%2C300%20minutes%20,under%20seven%20days)). Designing the crawler to handle subscription life cycle and batched change processing is important for long-term reliability.