How to crawl the full SharePoint Online tree using a Java‐based solution - krickert/search-api GitHub Wiki

How to crawl the full SharePoint Online tree using a Java-based solution (Micronaut framework), ensuring it integrates with your Okta-based authentication and Azure ACLs. The approach will include initial full crawls for seeding and real-time updates via change tracking. I’ll also explore best practices for tracking crawl history to enable efficient recrawls. I’ll get back to you with a structured approach and necessary tools/libraries soon.

Overview

Crawling a SharePoint Online tenant in Java (Micronaut framework) involves using Microsoft’s APIs to enumerate content, then capturing changes continuously. The solution will use a service account (or app principal) with broad read access, authenticating via Okta (as the Identity Provider federated with Azure AD) to obtain the necessary OAuth tokens. The crawler will perform an initial full crawl to seed a search index, then subscribe to real-time changes (via Microsoft Graph change notifications or delta queries) for incremental updates. It will maintain a crawl history state to avoid re-fetching unchanged data, and publish structured content to AWS MSK (Kafka) for downstream indexing in Solr. Below is a breakdown of the approach, including authentication, crawl implementation, Kafka integration, and best practices (incremental updates, rate limiting, etc.), with recommended libraries for each part.

Authentication & Access Setup (Okta + Azure AD)

Azure AD App Registration: Start by registering an Azure AD application with the permissions needed to read SharePoint content. For full tenant-wide crawling, grant app-only permissions such as Sites.Read.All or Sites.FullControl.All (the latter may be needed to discover all sites) ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Sites)) ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Files)). You can also use the granular Sites.Selected permission to limit access to specific site collections if full access is a concern ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=)). The app principal (service account) should be given read access across the tenant (either via the app permissions or by being added as a site admin on all sites, though app permissions are the preferred approach).

Okta Integration: If your organization uses Okta as the identity provider for Office 365, you’ll need to obtain an access token for Microsoft Graph via Okta. In many cases, Azure AD is federated with Okta for user authentication – but for app-only authentication, you can often bypass Okta and go directly to Azure AD’s token endpoint using the client credential flow. The typical approach is:

Microsoft Graph Java SDK: Use the official [Microsoft Graph SDK for Java](https://github.com/microsoftgraph/msgraph-sdk-java) for convenient API access. This SDK can handle authentication via a provider you supply (you’d implement an AuthenticationProvider that supplies the OAuth token). The Graph SDK provides high-level classes to enumerate sites, lists, list items, drive items, etc. – which is ideal for crawling. (Alternatively, you can use Micronaut’s HTTP client or Apache HttpClient to call the REST endpoints directly, but the SDK will save time.) Ensure the token has the “Sites.Read.All” (or higher) scope so Graph can read sites, lists, and files. With proper permissions, Graph supports access to SharePoint sites, lists, and drives (document libraries), including reading list items and documents ([Working with SharePoint sites in Microsoft Graph - Microsoft Graph v1.0 | Microsoft Learn](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0#:~:text=The%20SharePoint%20API%20in%20Microsoft,supports%20the%20following%20core%20scenarios)).

Handling Azure ACLs: To enforce security trimming in search results, capture the Access Control Lists (ACLs) for each item. Azure AD/SharePoint permissions can be retrieved via Graph or SharePoint API. For example, Graph’s drive and list item resources have a permissions relationship that lists who has access (e.g., users, Azure AD groups, sharing links) – you can call /sites/{siteId}/drives/{driveId}/items/{itemId}/permissions for each file. This may show individual permissions and share links. However, for full fidelity (including role inheritance and group membership), you might need to call SharePoint’s REST API (or Microsoft Graph beta) to get role assignments. An approach is to use the SharePoint REST endpoint /_api/web/GetFileById('<id>')/ListItemAllFields/RoleAssignments which returns the principals and roles with access. This extra ACL info can be included in the data sent to Kafka so the indexing process can store it for search trimming. Ensure your service principal has rights to read permissions (the Sites.FullControl.All permission will allow reading permissions, whereas read-only might not expose all ACL info ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Therefore%2C%20Glean%20requires%20FullControl%20to,is%20best%20for%20our%20customers))).

Initial Full Crawl (Seeding the Index)

The initial crawl will traverse the entire SharePoint Online hierarchy accessible to the service account and extract all content and metadata for indexing. Key steps for the full crawl:

  1. Discover All Site Collections: Use Microsoft Graph to list sites. Graph provides an endpoint to search for sites: GET /sites?search=* which can return all site collections (and perhaps subsites) to which the app has access. Another option is to query the SharePoint Admin API to get all site URLs, but with Sites.FullControl.All, Graph can auto-discover new sites ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=Sites)). If using Sites.Selected, you would maintain a list of site IDs/URLs to crawl (perhaps provided to your app). Retrieve the list of site IDs and URLs as the starting points.

  2. Enumerate Content in Each Site: For each SharePoint site:

    • Lists and Libraries: List all lists (GET /sites/{siteId}/lists). Identify document libraries (list baseTemplate == 101) vs. other lists. For document libraries, you can either use the Lists API or the Drives API. Document libraries are exposed as Drives in Graph, which is convenient for files. For example, the default Documents library of a site can be accessed via GET /sites/{siteId}/drive/root/children (root folder items) and recursively enumerated. Other document libraries will appear in GET /sites/{siteId}/drives. You can iterate through each drive’s folders and files. For each file (driveItem), collect its metadata (name, URL, lastModifiedDateTime, etc.) and if needed download the content stream via GET /sites/{siteId}/drive/items/{itemId}/content. For list items (non-document lists or site pages library), use GET /sites/{siteId}/lists/{listId}/items with expand=fields to get list columns. This returns items with their field values.

    • SharePoint Pages: SharePoint modern pages are typically stored in a Site Pages library (which is a document library). They can be fetched like other drive items in that library. Classic pages might be in the “Pages” library. Ensure to include those libraries in the crawl if present so that site pages (HTML content) can be indexed.

    • Permissions (ACLs): As mentioned, for each item or each library, retrieve permissions. You might do this on a second pass or lazily, since pulling ACLs for every item can be heavy. One strategy is to fetch at least the library-level or site-level permission info (which usually applies to all items unless broken inheritance), and only fetch item-specific permissions if an item has unique ACL (Graph driveItem has an @microsoft.graph.conflictBehavior or a flag if permissions are inherited vs unique). The Microsoft Graph might not directly tell if an item has unique permissions, so you may need to call SharePoint REST for HasUniqueRoleAssignments. Depending on your indexing needs, you could index at the library level security (most content inherits that) and handle exceptions for unique ACL items.

    • Data Extraction: For files (like PDFs, Word docs), you may need their text content for full-text search. You can either extract text in the crawler (e.g., download the file and run a text extractor like Apache Tika, then send the text to Kafka) or send the raw file content through Kafka to be parsed later. Often it’s better to do text extraction in the indexing pipeline to offload work from the crawler. At minimum, send the file’s URL or an ID so the indexer can fetch content if needed. If the crawler should handle it, be mindful of binary size when publishing to Kafka. You might convert documents to text and include the text in the message, or store the file in a blob store and send a reference.

  3. Pagination & Throttling: The Graph API will page results (often 200-1000 items per page for drives and lists). The Graph SDK or REST calls will return an @odata.nextLink when more results are available. Continue paging until all items are retrieved. Implement throttle handling – if Graph returns HTTP 429 or 503 with "Retry-After", respect that and pause accordingly. During a full crawl of a large tenant, it’s easy to hit Graph’s throughput limits, so you may need to crawl in segments (e.g., one site at a time, with slight delays or concurrency control). Microsoft notes that using a delta approach (explained below) can reduce the data transfer and risk of throttling ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Delta%20query%2C%20also%20called%20change,with%20a%20local%20data%20store)) ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Using%20delta%20query%20helps%20you,of%20the%20requests%20being%20throttled)). For the initial crawl, you have to fetch everything once, but do so efficiently: use $select to only retrieve fields you need (e.g., file name, URL, lastModified, etc., plus fields for lists) and $expand=fields judiciously to avoid extra calls per item.

  4. Parallelism: Micronaut can run tasks asynchronously – consider crawling multiple sites in parallel threads to speed up the seed crawl, but be cautious not to overrun API limits. A reasonable approach is to process one site (or a few small sites) at a time, and within a site parallelize fetching different document libraries concurrently. The Kafka backend can absorb out-of-order messages, but ensure you don’t overload the network. Monitor memory as well if you store a lot of data before sending to Kafka; it’s better to stream items out as you retrieve them.

Each piece of content retrieved (documents, list items, pages) should be transformed into a structured JSON (or Avro) message including key metadata: e.g., id, siteId, path/url, title, lastModified, created, author, file type, content (or content excerpt), and ACLs (list of allowed users/groups). Then publish it to Kafka (details below). This full crawl will feed the search indexer with an initial complete dataset.

Real-Time Updates (Graph Change Notifications & Delta Queries)

After seeding, the crawler should switch to incremental update mode. This ensures the search index stays up-to-date as users add or modify SharePoint content. There are two complementary strategies for this: Graph change notifications (webhooks) for push-based updates, and delta queries for pull-based incremental sync. A combined approach can offer real-time responsiveness with reliability:

Hybrid Approach: A best practice is to combine webhooks and delta. Webhooks give you near-real-time pushes for changes, and delta gives you a reliable way to periodically reconcile state. For example, the search company Glean describes that they use drive item webhooks to know which drives changed, and then schedule an incremental crawl (delta sync) on those drives, as well as a periodic full sync: “Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally… Glean schedules an incremental crawl over all drives with a webhook seen recently, or if the drive has not been crawled for a period of time (by default at least weekly)” ([Microsoft OneDrive/Sharepoint Connector | Glean Help Center](https://help.glean.com/en/articles/6974822-microsoft-onedrive-sharepoint-connector#:~:text=,change%20to%20a%20single%20item)). This means if a particular site or library has a lot of changes, the webhook ensures they sync immediately; if a site has no webhook (not subscribed) or missed events, a weekly delta crawl will still catch up on changes. You can implement a similar strategy: use webhooks for rapid updates on high-value content, and run a scheduled incremental scan (using delta or change log) for everything on a schedule (e.g., nightly or weekly) to pick up any missed updates or new sites that weren’t subscribed yet.

Micronaut’s scheduling features (@Scheduled) can be used to run these periodic delta sync jobs. Also, Micronaut’s reactive or async capabilities can handle processing webhook events concurrently. Ensure thread-safety when updating the stored crawl state (delta tokens etc.).

Tracking Crawl History & State

To prevent redundant indexing and know what to crawl, the system should maintain a crawl history (state store of what has been seen and when). Key elements:

In summary, design a small CrawlState store (maybe a Micronaut-managed bean that uses a database or even a file) which keeps: last crawl time, last delta token per list, and possibly a high-watermark of item count or IDs. This will guide incremental crawl runs and prevent re-indexing content that hasn’t changed.

Kafka Integration (Publishing to AWS MSK)

Once the data is extracted from SharePoint, the crawler needs to publish it to Kafka (Amazon MSK is just managed Kafka, so treat it as Kafka from the client perspective). Here’s how to integrate:

  • Data Structure: Define a clear schema for the messages. Using JSON is common for search indexing pipelines, or Avro for a strongly typed schema. The message should include all necessary fields for indexing in Solr. For example:

    {
      "id": "<unique id>",
      "siteId": "...", "siteName": "...", "listId": "...", "path": "/sites/Team/Shared Documents/filename.docx",
      "title": "filename.docx",
      "content": "<extracted text or summary>",
      "author": "John Doe",
      "created": "2021-01-01T12:00:00Z",
      "modified": "2021-06-01T08:30:00Z",
      "fileType": "docx",
      "url": "https://tenant.sharepoint.com/sites/Team/Shared%20Documents/filename.docx",
      "acl": ["User:AADGUID1", "Group:AADGUID2", "..."],
      "changeType": "add/update/delete"
    }

    Include a field for changeType or similar so the consumer knows if this is a new/updated document or a delete (in case of deletes, you might omit content). The id could be the SharePoint item unique ID or a composite of siteid_itemid, just something stable to use as Solr document key.

  • Micronaut Kafka Client: Micronaut has integration for Kafka (e.g., using @KafkaClient to produce messages). You can create a producer class, for example:

    @KafkaClient
    public interface SharePointProducer {
        @Topic("sharepoint-crawl")
        void sendCrawlEvent(String key, @KafkaKey String id, CrawlEvent event);
    }

    Micronaut will handle creating the Kafka producer under the hood. Alternatively, use the native Kafka Producer API (org.apache.kafka.clients.producer.KafkaProducer). Micronaut’s config can supply the Kafka bootstrap servers (point it to your MSK brokers and provide authentication config if needed, e.g., SASL credentials).

  • Key and Partitioning: Use a Kafka message key – for instance, the id of the item – so that updates to the same document go to the same partition (ordering of events per key is then guaranteed). This ensures if a create and an update come in sequence, they end up in order for the consumer. If ordering across the whole topic isn’t required, this per-key ordering is sufficient for correctness on a per-document basis.

  • Throughput Considerations: The crawler can send messages as it goes to avoid buffering too much in memory. If using the Graph SDK in a reactive stream or simply iterating, after retrieving an item’s data, call the Kafka producer to send the JSON. You might batch multiple items into one Kafka message if that suits your indexing pipeline, but usually one item per message is easier. For large binary content, consider not sending the raw file via Kafka due to size. If needed, you could send a reference (like the SharePoint URL or an S3 pointer if you choose to copy it) and have the indexer fetch it. Since Solr can index attachments via Apache Tika, you might also send the binary in a field if needed, but ensure Kafka and Solr can handle the size (maybe compress or base64 encode if small).

  • Error Handling: Ensure to handle any exceptions from Kafka (e.g., broker downtime). Micronaut Kafka client can be configured with retries. You don’t want to lose data, so implement a retry or DLQ (Dead Letter Queue) for messages that consistently fail to send. Usually, Kafka is durable enough that as long as the brokers are up, the send will succeed. You can also send with acks=all to ensure it’s committed.

  • Consumption in Solr: (Though the question doesn’t focus on Solr side, for completeness:) You likely have a separate consumer application that reads from the Kafka topic and updates Solr (perhaps using SolrJ or API). Make sure the message format aligns with what that consumer expects. Using a well-defined schema (even registering it in Schema Registry if using Avro) can help decouple the crawler from the indexer.

Best Practices for Incremental Crawling & Rate Limits

Managing incremental updates efficiently and staying within API limits is crucial:

  • Use Delta and Webhooks to Minimize Load: As described, rely on change notifications and delta queries so you’re not repeatedly pulling full data. “Delta query…requests only data that changed since the last request…reducing the cost of the request and likely limit the chances of the requests being throttled.” ([Use delta query to track changes in Microsoft Graph data - Microsoft Graph | Microsoft Learn](https://learn.microsoft.com/en-us/graph/delta-query-overview#:~:text=Using%20delta%20query%20helps%20you,of%20the%20requests%20being%20throttled)) This means your incremental crawls should usually be dealing with a small subset of items (only those changed). This keeps your Graph API calls light.

  • Exponential Backoff on Throttling: Microsoft Graph enforces throttling if you exceed certain thresholds (which are not always documented but generally, bursts of too many requests will return 429). Implement a handler: if a 429 or 503 is returned with a Retry-After header, pause for that duration. If none is given, use an exponential backoff (e.g., wait 1s, then 2s, 4s, etc.). The Graph SDK may handle some of this for you, but be prepared to catch GraphServiceException. Design the crawler to be resilient – e.g., if a delta sync fails due to throttle, it should retry after waiting, not skip those changes.

  • Parallelism vs. Throttling: Find a balance in how many Graph calls you do in parallel. You might parallelize by site or list, but Graph might count all calls from the app together. One approach is to use a small thread pool (say 5-10 threads) for Graph calls and tune based on observed throttle behavior. Also, combine calls where possible – for instance, use batch requests (Graph supports a batch API to put multiple calls in one HTTP request) if you need to fetch many individual items by ID triggered from notifications. But note the batch has its own limits (20 sub-requests per batch request).

  • Subscription Management: Keep track of your active subscriptions (if using webhooks). You might store them (subscription ID, resource, expiration) in a DB or config. On service restart, you may need to re-subscribe or at least know existing ones. A best practice is to have an automated process to renew or recreate subscriptions regularly (perhaps a daily check). Also handle the notification validation and security (Graph posts a validation token and expects response in plaintext; Graph also signs notifications if you set up client certificates for verification – consider this for security).

  • Initial vs Ongoing Crawl Infrastructure: The initial full crawl might be a one-time heavy operation. It could even be implemented as a separate tool or one-off job. The ongoing crawler is a continuously running service. Ensure once the full import is done, the system transitions to the incremental mode gracefully. You might include a flag in each message indicating whether it’s from full crawl or incremental, though the downstream likely doesn’t care.

  • Dealing with New Sites or Libraries: If new site collections are created in SharePoint, a full crawl app with Sites.FullControl should discover them (Graph search for sites might eventually list them). To be proactive, you can periodically run a site discovery (e.g., daily query for all sites) to catch new ones, then initiate a crawl on those. If using Sites.Selected permission, you’d need an external trigger (admin tells the crawler about the new site and you add it to crawl list). Make sure to handle those so no content is missed.

  • Micronaut Specific: Leverage Micronaut’s strengths – its HTTP client can be used if not using Graph SDK (Micronaut HTTP client is reactive and can easily call Graph endpoints with OAuth token). Micronaut’s dependency injection will help manage singletons like the Graph client or Kafka client. Also consider using Micronaut’s configuration to externalize things like API keys, Okta domain, Kafka brokers, etc., and its built-in support for AWS parameter store or Vault if needed for secrets.

  • Testing & Debugging: Use Microsoft Graph Explorer or Postman with your app’s credentials to test queries (making sure your app’s permissions are correctly consented). This helps verify that you can list sites, get items, delta links, etc. For Kafka, test the end-to-end by sending a sample message to a dev topic and ensure the Solr indexing picks it up.

By following this approach – an initial comprehensive crawl, followed by event-driven and token-based incremental updates, with careful tracking of state – you’ll have an efficient crawler that keeps your search index in sync with SharePoint. This Java/Micronaut solution, combined with Microsoft Graph (for SharePoint data) and Okta (for authentication), will be scalable and maintainable. It embraces Microsoft’s latest APIs and change tracking capabilities to avoid reprocessing unchanged content and to stay within rate limits, while reliably delivering content messages to Kafka for indexing.

References:

⚠️ **GitHub.com Fallback** ⚠️