Notes on protobuf serialization in solr - krickert/search-api GitHub Wiki

Here's how to integrate specialized logic for handling google.protobuf.Struct in your existing serializer (ProtobufSolrSerializer) to robustly handle dynamic fields, nested structures, and lists:

Improved flattenFields method with special Struct handling:

Modify the flattenFields method inside your existing serializer as follows:

Complete Implementation:

package com.krickert.search.solr.protobuf;

import com.google.protobuf.*;
import org.apache.solr.common.SolrInputDocument;

import java.time.Instant;
import java.util.List;
import java.util.Map;

public class ProtobufSolrSerializer {

    private static final String FIELD_DELIMITER = "__";

    public SolrInputDocument serialize(Message message, String typeName) {
        SolrInputDocument doc = new SolrInputDocument();

        flattenFields("", message, doc);

        doc.addField("proto_type_s", typeName);
        doc.addField("proto_blob", message.toByteArray());

        return doc;
    }

    private void flattenFields(String prefix, Message message, SolrInputDocument doc) {
        if (message instanceof Struct) {
            handleStruct(prefix, (Struct) message, doc);
            return;
        }

        for (Map.Entry<Descriptors.FieldDescriptor, Object> fieldEntry : message.getAllFields().entrySet()) {
            Descriptors.FieldDescriptor field = fieldEntry.getKey();
            Object value = fieldEntry.getValue();
            
            String fieldName = prefix.isEmpty() ? field.getName() : prefix + FIELD_DELIMITER + field.getName();

            if (field.isRepeated()) {
                List<?> values = (List<?>) value;
                for (Object val : values) {
                    handleFieldValue(fieldName, val, doc);
                }
            } else {
                handleFieldValue(fieldName, value, doc);
            }
        }
    }

    private void handleFieldValue(String fieldName, Object value, SolrInputDocument doc) {
        if (value instanceof Struct) {
            handleStruct(fieldName, (Struct) value, doc);
        } else if (value instanceof Value) {
            handleStructValue(fieldName, (Value) value, doc);
        } else if (value instanceof ListValue) {
            handleListValue(fieldName, (ListValue) value, doc);
        } else if (value instanceof Timestamp) {
            Timestamp ts = (Timestamp) value;
            Instant instant = Instant.ofEpochSecond(ts.getSeconds(), ts.getNanos());
            doc.addField(fieldName + "_dt", instant.toString());
        } else if (value instanceof Message) {
            flattenFields(fieldName, (Message) value, doc);
        } else {
            addPrimitiveField(fieldName, value, doc);
        }
    }

    private void handleStruct(String prefix, Struct struct, SolrInputDocument doc) {
        struct.getFieldsMap().forEach((key, val) -> {
            String fieldName = prefix.isEmpty() ? key : prefix + FIELD_DELIMITER + key;
            handleStructValue(fieldName, val, doc);
        });
    }

    private void handleStructValue(String fieldName, Value value, SolrInputDocument doc) {
        switch (value.getKindCase()) {
            case NULL_VALUE:
                doc.addField(fieldName + "_s", null);
                break;
            case NUMBER_VALUE:
                doc.addField(fieldName + "_d", value.getNumberValue());
                break;
            case STRING_VALUE:
                doc.addField(fieldName + "_s", value.getStringValue());
                break;
            case BOOL_VALUE:
                doc.addField(fieldName + "_b", value.getBoolValue());
                break;
            case STRUCT_VALUE:
                handleStruct(fieldName, value.getStructValue(), doc);
                break;
            case LIST_VALUE:
                handleListValue(fieldName, value.getListValue(), doc);
                break;
            default:
                doc.addField(fieldName + "_s", value.toString());
                break;
        }
    }

    private void handleListValue(String fieldName, ListValue listValue, SolrInputDocument doc) {
        List<Value> values = listValue.getValuesList();
        for (Value val : values) {
            handleStructValue(fieldName, val, doc); // Multiple values stored under same field as multi-valued
        }
    }

    private void addPrimitiveField(String fieldName, Object value, SolrInputDocument doc) {
        if (value instanceof String) {
            doc.addField(fieldName + "_s", value);
        } else if (value instanceof Integer) {
            doc.addField(fieldName + "_i", value);
        } else if (value instanceof Long) {
            doc.addField(fieldName + "_l", value);
        } else if (value instanceof Float) {
            doc.addField(fieldName + "_f", value);
        } else if (value instanceof Double) {
            doc.addField(fieldName + "_d", value);
        } else if (value instanceof Boolean) {
            doc.addField(fieldName + "_b", value);
        } else if (value instanceof Enum<?>) {
            doc.addField(fieldName + "_s", ((Enum<?>) value).name());
        } else {
            doc.addField(fieldName + "_s", value.toString());
        }
    }
}

Explanation of Struct-specific Enhancements:

  • handleStruct:
    Recursively flattens the fields in a Struct into Solr fields, using dynamic field names (key) with nested prefixes.

  • handleStructValue:
    Properly handles Value type, which can be one of several types (null, number, string, boolean, struct, or list).

  • handleListValue:
    Supports lists of mixed types, adding multiple values to the same Solr field (multi-valued fields). Each list element is recursively processed, ensuring robust parsing of deeply nested data.

  • Dynamic Fields in Solr schema:
    Use dynamic fields defined earlier (such as _s, _d, _b, etc.) to allow proper indexing without manual schema updates.

Example JSON Representation in Solr after Flattening a Struct:

Suppose your original Protobuf contained a field called userDetails of type Struct:

"userDetails__name_s": "John Doe",
"userDetails__age_d": 42,
"userDetails__preferences__colors_s": ["red", "green"],
"userDetails__preferences__notifications__email_b": true,
"userDetails__preferences__notifications__sms_b": false

With this serializer, all nested keys and lists are correctly expanded and indexed with appropriate field suffixes (_s, _d, _b, etc.).


Deserialization (for completeness):

Since Struct type information is dynamic, upon deserialization from the raw proto_blob you can simply use Protobuf’s built-in parsing:

Struct struct = Struct.parseFrom(protoBlob);

This approach simplifies deserialization considerably, leveraging Protobuf's built-in type handling. The key part of the complexity (flattening and indexing) is at serialization.


Key benefits of this solution:

  • Complete type coverage: all Struct nested types and lists handled properly.
  • Dynamic fields: new fields auto-accepted in Solr without schema changes.
  • Lossless: raw Protobuf bytes stored for exact reconstruction.

This implementation robustly handles the complexities introduced by the dynamic Struct and related Protobuf types.

Solr Extension for Protobuf: Full Fidelity and Searchable Fields

Overview

Integrating Protocol Buffers (Protobuf) with Apache Solr requires converting Protobuf messages into a flat, indexable format while preserving the ability to reconstruct the original message. The goal is to serialize incoming Protobuf structures into Solr fields for efficient querying, and deserialize (reconstruct) them with full fidelity when needed. Key requirements include handling Protobuf’s well-known types, flattening nested fields for search, storing raw data for lossless retrieval, and designing a flexible schema that can evolve over time. In practice, this means flattening complex message hierarchies into Solr’s schema, using dynamic fields to accommodate unknown or new fields, and possibly writing a custom Solr plugin to manage the serialization/deserialization process.

Flattening Protobuf Data for Indexing

Solr’s indexing model is essentially flat – each document is a set of field name/value pairs. Nested structures in Protobuf must be flattened into this flat schema to be indexed. In fact, Solr cannot directly index deeply nested data without such transformation: “Solr can only maintain a 'flat' representation of the data. What you were trying to do is not really possible [with nested objects]. There are a number of workarounds, such as using dynamic fields and using a Solr join to link multiple data sets.” ([Deeply nested JSON documents in Apache Solr - Stack Overflow](https://stackoverflow.com/questions/35502812/deeply-nested-json-documents-in-apache-solr#:~:text=Upd1%20%3A%20Solr%20can%20only,to%20link%20multiple%20data%20sets)). The most straightforward approach is to flatten nested Protobuf messages into hierarchical field names. For example, if you have a Protobuf field address that itself has sub-fields city and zip, you might index them as address__city and address__zip in Solr. Here we use a delimiter (e.g. double underscore __) to represent nesting levels in field names. This flattening is done recursively for any nested message structures. Each leaf field in the Protobuf (a primitive, string, etc.) becomes a separate Solr field, prefixed by the path of its parent message fields joined by the delimiter.

When flattening, repeated fields (lists) need special handling. If a Protobuf field is repeated (e.g. an array of strings or a list of message objects), you can map it to a multi-valued Solr field. For repeated primitives, simply add multiple values under the same Solr field name. For repeated embedded messages, one approach is to flatten each repeated element with an index in the field name (e.g. addresses__0__city, addresses__1__city…), though this can complicate querying. An alternative is to index repeated messages as separate child documents, but that requires Solr’s nested document support and more complex queries. In many cases, treating repeated subfields as multi-valued fields (concatenating values or indexing each value in a multi-valued field) is sufficient. The main idea is that after flattening, every nested field becomes a first-class field in Solr that can be directly queried, faceted, or filtered.

This flattening logic can be implemented using Protobuf’s reflection API or generated classes. For instance, you can iterate over all fields of a message and recursively add them to a Solr document. Pseudocode for flattening might look like this:

// Pseudocode: Flattening a Protobuf message into Solr fields
void flattenMessage(Message msg, String prefix, SolrInputDocument doc) {
    for (Map.Entry<FieldDescriptor, Object> entry : msg.getAllFields().entrySet()) {
        FieldDescriptor fd = entry.getKey();
        Object value = entry.getValue();
        String fieldName = prefix.isEmpty() ? fd.getName() : prefix + "__" + fd.getName();
        if (fd.isRepeated()) {
            // Handle repeated fields
            for (Object element : (List<?>) value) {
                if (fd.getJavaType() == FieldDescriptor.JavaType.MESSAGE) {
                    // Flatten each message in the repeated field
                    flattenMessage((Message) element, fieldName, doc);
                } else {
                    doc.addField(fieldName, element);  // add each primitive element
                }
            }
        } else if (fd.getJavaType() == FieldDescriptor.JavaType.MESSAGE) {
            // Nested message field – flatten it recursively
            flattenMessage((Message) value, fieldName, doc);
        } else {
            // Primitive or string field – add directly
            doc.addField(fieldName, value);
        }
    }
}

After this process, a Protobuf message is represented in Solr as a flat document with potentially many fields named by the protobuf field path. Flattening with a delimiter improves queryability because Solr can now index each sub-field independently. For example, a query can directly filter on address__city:"New York" or address__zip:10001 as if they were top-level fields.

Handling Data Types and Well-Known Protobuf Types

When mapping Protobuf fields to Solr, it’s important to preserve data types for accurate querying and reconstruction. Here are best practices for common types and well-known Protobuf types:

  • Numeric fields (int32, int64, float, double): Map these to Solr numeric field types (e.g. pint or plong for 32-bit vs 64-bit integers, pfloat/pdouble for floating point) so that range queries and sorting work correctly. For example, a proto int64 user_id could be stored in a Solr field user_id_l (with a dynamic field *_l of type long). This retains numeric precision and allows numeric filtering/ranges.

  • Strings: Simple string fields can be stored in Solr as either a keyword (untokenized string) or a text field (with analysis) depending on how you need to query them. For exact matches or identifiers, use a Solr StrField (e.g. dynamic *_s). For full-text search content, use a text field type (e.g. *_t or *_txt using text_general analyzer). By default, treat Protobuf strings as keywords unless certain fields are known to contain free-form text.

  • Booleans: Map to Solr boolean fields. Solr has a BooleanField type (true/false). Simply store as true/false values.

  • Enumerations: You can store enums as their name (string) or numeric value. Storing the name is usually more convenient for querying (e.g. status:"ACTIVE"). Enums in Protobuf are integers under the hood, but you have the enum descriptors – you can add a field with the enum’s string name (or even index both name and number if needed). Using the name in a *_s field works well for exact matching.

  • Timestamps (google.protobuf.Timestamp): Convert to a Solr date/time field. A Timestamp in proto contains seconds and nanos fields (UTC epoch). You should convert this to a standard ISO 8601 UTC timestamp string (RFC 3339 format), which Solr’s date field type (DateField) can parse. For example, a proto Timestamp of seconds=1609459200 (2021-01-01T00:00:00Z) would be stored as "2021-01-01T00:00:00Z" in Solr. The Protocol Buffers JSON mapping specifies that Timestamp should be encoded as an RFC 3339 timestamp string ([c# - Protobufs Timestamp as RFC 3339 string - Stack Overflow](https://stackoverflow.com/questions/76167711/protobufs-timestamp-as-rfc-3339-string#:~:text=expected%20behaviour%20is%20for%20RFC,used%20when%20mapping%20to%20JSON)), which aligns with Solr’s date format. Using Solr’s date type allows range queries (e.g. find all events between two dates). If nanosecond precision is needed, you might format the fractional seconds up to 9 digits (Solr date can handle millisecond precision by default, so consider if you need to store nanos separately or as part of a string).

  • Struct and Value (google.protobuf.Struct, Value, ListValue): These types represent dynamic JSON content (a map of string keys to dynamically typed values). To index a Struct, treat it like a nested message: each key in the Struct becomes a sub-field in Solr. For example, if a Struct field details contains {"foo": 123, "bar": "text"}, you could index details__foo = 123 (as an int field) and details__bar = "text" (as a string). The google.protobuf.Value can hold a primitive, list or struct – you’ll need to check the kind of value. If it’s a list (ListValue), you can treat it similar to a repeated field (multiple values for a Solr field or flatten each element if they are structs). If it’s a null, you might skip indexing or index as a special marker. Essentially, flatten the Struct’s contents as if it were just another nested message with dynamic keys. (If the set of keys is unpredictable, dynamic fields in Solr will catch them as new fields at index time.)

  • Any (google.protobuf.Any): This type can wrap an arbitrary message. An Any contains a type_url (a string identifying the actual message type URL) and a binary blob of the value. To handle Any, you have a couple of options:

    • If the type is known/expected: You can use the type_url to determine the embedded message type, parse the embedded bytes into that message, and then flatten it as usual (prefixed by the Any’s field name). This requires that your Solr extension has access to the descriptors or classes of those potential message types. For example, if an Any actually contains a MyMessage, you would parse it and then produce fields like anyField__myMessageField1, etc.
    • If the type is unknown or not supported by the index: Use a fallback. You might store the raw bytes or a placeholder. One common approach is to index the type_url as a string field (so you at least know what kind of message it was), and store the entire Any value in a separate blob or text field for later retrieval. For searchability, you could include the JSON representation of the Any (if you can decode it generically) in a text_general field. This way, if the embedded message had some text, it could be found via full-text search even if not explicitly mapped. The key is that unsupported content in an Any should not break the indexing; you provide a graceful fallback (index what you can, store the rest).
  • Bytes (bytes fields in proto): Raw bytes aren’t searchable in Solr (unless they represent text). If the bytes field represents binary data (images, etc.), you will likely skip indexing its content. You can still store it (perhaps as part of the raw protobuf blob we plan to keep). If the bytes are actually encoded text (like a JSON string in bytes), you could index them as text, but that’s rare. Generally, treat bytes as non-searchable payload – store them for completeness but do not index. If needed, you could index a hash or length of the bytes just for filtering or debugging.

  • Wrapper types (google.protobuf.Int32Value, BoolValue, etc.): These are just wrappers around primitives (they exist to represent optional primitives in proto3). You can simply treat them as their contained value (the presence of the wrapper means the field is set). So an Int32Value becomes an integer field in Solr, a StringValue becomes a string field, etc. There is no special handling needed beyond unwrapping the value.

By handling each type appropriately, we maintain type fidelity in Solr – e.g. dates remain dates, numbers remain numeric – which enables proper querying (range queries, sorting, etc.) and ensures that when we reconstruct the Protobuf, we can get the same typed values back.

Storing Raw Protobuf Blobs for Full Fidelity

While flattening and indexing fields provides searchability, we also want to be able to reconstruct the exact original Protobuf message (with all the same data, including any fields that might not have been indexed or known). To ensure full fidelity reconstruction, it’s recommended to store the raw Protobuf binary data in Solr alongside the indexed fields. This can be done by having a dedicated Solr field (for example, proto_blob) of type BinaryField to hold the serialized Protobuf bytes. Mark this field as stored (so it can be retrieved), but it need not be indexed (since we won’t search on the raw bytes) ([solr4 - Querying binary fields in Solr - Stack Overflow](https://stackoverflow.com/questions/32484920/querying-binary-fields-in-solr#:~:text=)). The Solr BinaryField type is meant exactly for this purpose: “BinaryField is only intended for storage of binary data. Nothing more, nothing less.” ([solr4 - Querying binary fields in Solr - Stack Overflow](https://stackoverflow.com/questions/32484920/querying-binary-fields-in-solr#:~:text=)). By storing the exact bytes, we can retrieve a document from Solr and get the original Protobuf blob back, then parse it or return it directly to a client, ensuring no information was lost in the indexing process.

In addition to the blob, store metadata about it, notably the Protobuf message type. For example, you might have a field proto_type_s that stores the fully qualified type name (e.g. "com.mycompany.MyMessageV2"). This helps in choosing the correct parser or schema for deserialization later. If you expect multiple message types in the same Solr collection, this type field is essential to know how to interpret the blob. If it’s a single message type overall, the type might be implicit, but it’s still useful for future-proofing or debugging.

Storing the raw blob has a few advantages:

  • Lossless retrieval: Even if your indexing logic doesn’t handle a field (or a new field is added in the protobuf later which wasn’t flattened), the raw data is still there. You can reparse it with updated code to extract new fields.
  • Simpler reconstruction: Instead of rebuilding the Protobuf object from dozens of Solr fields, you can often just take the stored bytes and directly deserialize into the Protobuf class. This avoids any errors in reassembly and guarantees the message is exactly as it was originally. For example, if someone needs the Protobuf message in a service response, you can fetch the proto_blob field and send it as-is, or parse it and send a JSON representation – either way, it’s using the source-of-truth data.
  • Unknown/Unused field preservation: Proto messages may contain unknown fields (when decoded with an older schema) – these would be preserved in the binary blob even if they weren’t indexed. Thus, if you later update your schema, you could re-index or reprocess the blob to surface those fields.

In practice, storing the blob means a bit of storage overhead (since data is duplicated: once in parsed fields, once as blob). You could choose to compress the blob (Protobuf is already compact, but if needed, Solr can store compressed binary data). Typically, this trade-off is worthwhile for the flexibility it gives. The blob field should be stored but not indexed (and not doc-valued) to minimize performance impact. Retrieval of binary fields in Solr is usually done as base64-encoded strings in the JSON/XML response, or you can use a binary response format.

Ensuring Searchability and Fallbacks for Nested Fields

With all fields flattened and indexed, nested data becomes fully searchable. Each nested Protobuf field can be queried in Solr using standard field queries or filters. For example, if you indexed user__address__city = "San Francisco", a query for user__address__city:"San Francisco" will match that document. This approach effectively exposes the deep structure of Protobuf as if it were a simple document in Solr. You can also use Solr’s faceting, sorting, and other features on these fields. If a Protobuf has an array of values and you indexed them as multi-valued fields, Solr can still search and facet on them (Solr supports multi-valued field queries seamlessly). In Slack’s case, they noted that “Solr offers an easy query mechanism for array-based fields… and adding new fields to a Solr document is easier than adding a new column to a database table” ([Managing Slack Connect - Engineering at Slack](https://slack.engineering/managing-slack-connect/#:~:text=had%20a%20five%20second%20delay,choice%3A%20we%20now%20have%20a)), which was one reason they chose Solr for indexing their Protobuf-based data. This highlights that once your data is in Solr fields, you benefit from Solr’s rich query capabilities even for what were nested or repeated structures in the original protos.

It’s important to provide a fallback mechanism for any data that isn’t easily mapped to a first-class Solr field. Despite our best efforts, there may be some parts of the Protobuf that are not indexed (for example, an unknown message type inside an Any, a byte blob we decided not to parse, or extremely nested data that we chose not to fully explode). For such cases, consider using a catch-all text field (e.g. a Solr field of type text_general) to store a JSON or stringified representation of that data. For instance, if you have an Any that you don’t know how to parse, you might index the type_url as a string (so you can at least query by type), and also take the entire value and JSON-encode it and put it into a all_content_txt field (of type text_general). That way, if someone searches for a keyword that happens to appear in the JSON of that embedded message, the document can still be found. Another example: if you skip indexing a bytes field that contains text data, you might at least dump a hex string of it into a text field (not ideal for human search, but possibly for exact matching or debugging).

Additionally, you can configure Solr’s schema to have a dynamic copy field rule, copying all or certain fields into a single catch-all field for free-text search. A common pattern is to copy all textual fields into a field named “text” or “all_text” (of type text_general). This allows a user to perform a broad search without specifying field names (just like a full-text search across the entire document) ([Deeply nested JSON documents in Apache Solr - Stack Overflow](https://stackoverflow.com/questions/35502812/deeply-nested-json-documents-in-apache-solr#:~:text=Upd1%20%3A%20Solr%20can%20only,to%20link%20multiple%20data%20sets)). In our context, we might copy every string field (or every field) into an all_text field. This ensures that even structured fields can be searched in an unstructured way if needed. For Protobuf data, this can be useful if someone wants to find a term without knowing which field it resides in. It also provides a safety net for any content we shoved into text blobs as a fallback – it’s all searchable via the all_text catch-all.

Summary of fallback strategy: Index what you can in structured form (for precise queries), and for anything else, index it in a generalized text form. By doing so, no piece of data is completely unsearchable. Unsupported types or unknown fields are not dropped on the floor; they end up in some index form (raw or text) that can later be utilized. This approach balances query power with completeness of data indexing.

Using Dynamic Fields for Evolving Schemas

One challenge with indexing Protobuf messages is that the schema might evolve – new fields could be added, or different message types might be indexed over time. Managing a static Solr schema under these conditions can be cumbersome, as you’d have to predefine every possible field. This is where dynamic fields in Solr are extremely useful. Dynamic fields are schema field definitions that use wildcards in their name, so they can match arbitrary field names at index time ([Dynamic Fields :: Apache Solr Reference Guide](https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html#:~:text=Dynamic%20fields%20allow%20Solr%20to,explicitly%20define%20in%20your%20schema)). For example, you might define in Solr’s schema:

  • <dynamicField name="*_s" type="string" indexed="true" stored="true"/> for any string fields
  • <dynamicField name="*_i" type="pint" indexed="true" stored="true"/> for 32-bit ints
  • <dynamicField name="*_l" type="plong" indexed="true" stored="true"/> for 64-bit longs
  • <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> for booleans
  • <dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/> for double precision floats
  • <dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/> for single precision floats
  • <dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/> for date/timestamps
  • <dynamicField name="*_txt" type="text_general" indexed="true" stored="false"/> for catch-all text content (if using such suffix)

These are just examples – the naming scheme is up to you. One strategy is to suffix field names with a type indicator (as shown above), which the dynamic field patterns then match. If our flattening code follows a convention like __ for nesting and _s or other suffixes for type, we can ensure the field gets the correct Solr type. For instance, the code could name a field address__zip_i for an integer ZIP code, or name_s for a string name. Solr will then automatically assign the right field type based on the dynamic field rules when these fields appear ([Dynamic Fields :: Apache Solr Reference Guide](https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html#:~:text=A%20dynamic%20field%20is%20just,matched%20with%20a%20dynamic%20field)). This avoids the need to explicitly define each field in the schema.

Using dynamic fields makes the system extensible and schema changes low-effort. If tomorrow you add a new field in your Protobuf message (say age as an int), when the Solr plugin indexes it as age_i, Solr will accept it because it matches *_i. You don’t have to modify the Solr schema manually. Slack’s engineering team leveraged this capability when indexing Protobuf-backed data – it allowed new fields (new “settings” in their case) to be indexed automatically without a schema update: “we took advantage of the ability to dynamically add fields to a Solr document, which allowed for all newly created settings to be automatically indexed in Solr.” ([Managing Slack Connect - Engineering at Slack](https://slack.engineering/managing-slack-connect/#:~:text=without%20the%20need%20to%20manually,document%2C%20which%20allowed%20for%20all)). This is a powerful feature for rapid iteration and evolution.

However, be mindful of a few best practices/challenges with dynamic fields:

  • Consistent naming: Your flattening code should consistently generate field names that align with the dynamic field patterns. Decide on a scheme (prefix vs suffix vs both). Suffixes for type are common, but you might also encode type in the field type (e.g. use separate dynamic rules for known field name patterns). Using suffixes like _s, _i, _l, _dt etc., as above, is straightforward and self-explanatory.
  • Field explosion: If your Protobuf has truly dynamic keys (e.g. using Struct or maps with arbitrary keys), you could end up with a very large number of distinct field names in Solr (especially across many documents). Solr can handle a large number of fields, but extremely high cardinality of field names can impact memory (each field adds some overhead in the schema). Monitor how many unique fields get created. If unbounded, you may need to periodically review and possibly clean up unused fields, or reconsider indexing extremely dynamic content (maybe index as a blob of JSON text instead).
  • Multi-valued fields: Define dynamic fields as multiValued=true if you expect repeated values. For example, if tags_txt is used for a list of tags, define *_txt as multiValued (or define a separate dynamic pattern for known list fields). In many cases, Solr can infer multi-valued if you send an array in JSON, but it’s safer to mark them accordingly in schema. Slack noted Solr’s ease with array fields as an advantage ([Managing Slack Connect - Engineering at Slack](https://slack.engineering/managing-slack-connect/#:~:text=had%20a%20five%20second%20delay,choice%3A%20we%20now%20have%20a)) – just ensure the schema knows which fields can have arrays.
  • Schema filtering: If multiple message types go into one Solr index, not all fields will exist on all documents. That’s fine; dynamic fields will just be sparsely populated. You might want a field on each document indicating the message type (as discussed earlier) so you can filter queries by type if needed (e.g. only search “User" messages vs “Order" messages, etc., since their fields differ).

In summary, dynamic fields provide the flexibility needed for indexing Protobufs without constant schema maintenance ([Dynamic Fields :: Apache Solr Reference Guide](https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html#:~:text=Dynamic%20fields%20allow%20Solr%20to,explicitly%20define%20in%20your%20schema)). They are a critical part of this design, enabling us to map flattenned Protobuf fields to Solr types on the fly and accommodate new fields and message variations with minimal effort.

Implementing a Custom Solr Plugin (Serialization/Deserialization)

To tie all these pieces together, a custom Solr extension (plugin) is highly recommended. Specifically, an Update Request Processor (URP) is a suitable Solr plugin type for this purpose. An Update Request Processor can intercept documents as they are being indexed and transform them – many built-in Solr features (like field name aliasing, auto-adding timestamps, etc.) use URPs under the hood ([Update Request Processors | Apache Solr Reference Guide 8.3](https://solr.apache.org/guide/8_3/update-request-processors.html#:~:text=Every%20update%20request%20received%20by,Update%20Request%20Processors%2C%20or%20URPs)). We can create a Protobuf indexing URP that performs the following on each incoming document/record:

  1. Receive the Protobuf input – This could be in the form of a binary payload or a base64 string in a field. For example, you might send a document to Solr with a field proto_blob containing the Protobuf bytes (encoded as Base64 in JSON). Alternatively, you could write a custom UpdateRequestHandler that directly accepts binary (but using the existing JSON handler with a base64 field is simpler). The URP will extract this raw data. It would also look at the proto_type field (or know the expected type via configuration).

  2. Deserialize to a Protobuf Message – Using the proto_type info or a configured message class, the processor will parse the binary blob into a Protobuf Message object. This can be done via generated code (e.g. MyMessage.parseFrom(bytes) in Java) or via the dynamic Message API if you want a generic solution. You need the protobuf class definitions available on the classpath (or descriptors). If multiple types are allowed, one approach is to include the fully qualified type name in the document (as mentioned) and use a registry or map of type name to a parser. Another approach is to wrap all messages in an Any and just always parse to Any then unpack, but that’s effectively similar.

  3. Flatten the Message into Solr Fields – Now apply the flattening logic described earlier. The URP will create Solr fields for every nested field in the Protobuf message. In Solr’s SolrInputDocument API, you can add fields or remove fields easily. For example, remove the original proto_blob field (if it was just an input carrier), and instead add the structured fields (name_s, address__city_s, age_i, etc.). If the message has fields that need special handling (Timestamp, Any, etc.), the URP will perform those conversions (e.g. converting Timestamp to formatted date string, unpacking Any or storing it as needed). This is the core serialization logic where your code ensures that each Protobuf field is mapped to a Solr field value with the right type.

  4. Add metadata fields – The URP should add the proto_blob field back (for storage) and the proto_type_s field (with the type name). Essentially, the output Solr document will contain all the search-friendly fields plus these metadata fields. You might choose to compress or not re-add the blob if the input already had it, but typically you’d keep it stored. Ensure that the blob field in Solr is marked stored="true" indexed="false" so it doesn’t bloat the index but can be retrieved when needed.

  5. Next in chain – Finally, the URP passes the modified document along to the next processor (or to indexing) by calling super.processAdd(cmd) (in Java URP API, or processor.processAdd(cmd) in older style). From Solr’s perspective, it now has a regular flat document with all necessary fields, and it will index it according to the schema.

By implementing this as a Solr-side plugin, you centralize the logic and ensure every document goes through the same transformation. This is easier to maintain in the long run – if the Protobuf schema changes or you need to tweak mapping, you update the plugin code in one place. It also means clients don’t need to pre-process Protobufs into JSON themselves; they can send the binary (or minimally processed data) and Solr will handle it.

To illustrate, here’s a simplified sketch of how an UpdateRequestProcessor might handle an incoming doc (in Java-like pseudocode):

public class ProtobufUpdateProcessor extends UpdateRequestProcessor {
    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
        SolrInputDocument doc = cmd.getSolrInputDocument();
        // Extract raw proto bytes (assuming it was sent as base64 string in a field)
        byte[] protoBytes = null;
        Object protoFieldVal = doc.getFieldValue("proto_blob_input");
        if (protoFieldVal != null) {
            protoBytes = Base64.getDecoder().decode((String) protoFieldVal);
            doc.removeField("proto_blob_input");
        }
        if (protoBytes != null) {
            // Determine type
            String typeName = (String) doc.getFieldValue("proto_type");  // e.g. "com.my.MessageType"
            Message protoMessage = parseProtoBytes(typeName, protoBytes);
            // Flatten the protobuf message into Solr fields
            flattenMessage(protoMessage, "", doc);
            // Add stored raw blob and type metadata
            doc.addField("proto_blob", protoBytes);
            doc.addField("proto_type_s", typeName);
        }
        // continue the chain
        super.processAdd(cmd);
    }
}

In the above pseudo-code, proto_blob_input is the incoming field carrying data (which we remove after parsing), and proto_blob / proto_type_s are the final stored fields. The method flattenMessage would be similar to the earlier snippet, adding fields to doc. The actual implementation would need to handle exceptions (e.g. if parse fails) and possibly multiple message types elegantly. The URP factory (and Solr config) could specify what classes or descriptors to use for parsing if not provided per document.

If performance is a concern, consider that parsing Protobuf is very fast (much faster than JSON) and the overhead of flattening is comparable to building a JSON object. Solr indexing cost will primarily be affected by the number of fields added. Batch indexing can be done by sending multiple protos in one request; the URP will handle each doc in turn.

Additionally, you might implement a Solr Response Writer as a complementary plugin to output search results as Protobuf, if needed. For example, a custom response writer could take Solr documents and reconstruct Protobuf messages on the fly (using stored fields) and return a binary stream or base64. This would allow retrieving data from Solr directly as Protobuf objects. However, this is an optional extension – many use-cases might simply fetch JSON and parse it into proto in the client if needed, or fetch the raw blob and decode client-side. The main point is the data in Solr is sufficient to reconstruct the proto either in Solr or outside.

Schema Evolution and Type Definitions

Handling schema evolution is critical for long-term maintenance. Protobuf schemas will change over time – new fields get added, some fields might become deprecated, message versions could increment. To ensure that your Solr indexing keeps working and that you can still deserialize old data, you should store type definitions or version information separately and design for forward/backward compatibility.

A recommended practice is to use an approach analogous to a schema registry. In the same way Confluent Schema Registry manages Avro/Protobuf schemas with version IDs, you can maintain a registry of Protobuf definitions that your Solr plugin knows about. Notably, Protobuf data itself does not carry its schema with it (unlike, say, Avro where each record can embed schema info). Instead, you rely on the message type and pre-compiled classes to parse. Confluent’s approach for Protobuf is to include a schema ID with the message, rather than the full schema ([Protobuf Schema Serializer and Deserializer for Schema Registry on Confluent Platform | Confluent Documentation](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/serdes-protobuf.html#:~:text=The%20Confluent%20Schema%20Registry%20based,To%20learn)). We can apply a similar idea: include a version identifier for the Protobuf schema in the indexed data. For example, you might add a field proto_schema_version_i or incorporate version into the type name (like MyMessage:v2). This version could correspond to a specific proto definition stored elsewhere (in a file or database).

By storing the version (or using the type name which implies a version), you enable proper deserialization later. If your Protobuf changes in a backward-compatible way (added fields), old messages can still be parsed by the new parser (the unknown fields would be preserved in the UnknownFieldSet). If changes are breaking (e.g. a field’s meaning changed or was removed), you might need to keep the old proto definition around. In such cases, your Solr plugin can use the version to decide how to parse: e.g. if proto_type_s is "MyMessage" and schema_version=1, use MyMessageV1 parser; if version=2, use MyMessageV2. This means you’d maintain multiple versions of the Protobuf class in your indexing code (or use DynamicMessage with a stored FileDescriptor for version 1 vs version 2). It’s extra work, but it ensures that data indexed with an older schema isn’t misinterpreted. Another approach is to re-index or transform old data whenever you do a schema migration (i.e. backfill new fields or adjust data to new format), but that may not always be feasible immediately.

Storing the actual .proto definitions or descriptors somewhere accessible is highly useful. You might keep a repository of proto descriptor files (e.g. compiled .desc file per version) in your application. If an old record fails to parse with the latest proto class, you could fall back to parsing it using a DescriptorPool loaded with the older schema. This is advanced, but the idea is to never lose the ability to interpret your data. The type metadata stored in Solr helps as a key to find the right schema. As one expert noted, if you want a registry for Protobuf, you can use the descriptor APIs in protobuf to manage schema definitions outside of code ([Protobuf and Schema Registry - Google Groups](https://groups.google.com/g/protobuf/c/Te82NJTzQ5c#:~:text=Protobuf%20and%20Schema%20Registry%20,and%20DescriptorDatabase%20and%20related%20classes)).

In practice, many Protobuf evolutions are backward-compatible (additive). If you follow Protobuf guidelines (never reuse field numbers, only add fields or rename but keep numbers the same), then old data can be parsed by new code: the old fields might appear as unknowns in the new parser, but they won’t be lost if you preserved the raw bytes. When you re-serialize (or if you have the unknown field set), those unknowns can be forwarded. With our strategy of storing the raw blob, you always have the option to retrieve the exact original bytes and perhaps decode them with an older version of the software if absolutely needed. This is another safety net for fidelity.

Potential challenges and best practices for schema evolution:

  • Multiple message types: If your Solr index stores different kinds of messages (say, a generic event log with various event types), you must index a type field (we did) and ensure your client queries filter by type when necessary. This also means maintaining all those proto definitions in the Solr plugin or available to it.
  • Deployment and compatibility: When updating the Protobuf schema and Solr plugin, consider a rolling upgrade strategy. You may need to handle a window where new data (with new fields) is coming in while the plugin is still an old version or vice versa. One approach is to deploy the new plugin that knows about both old and new schema (if possible) before rolling out producers that send new fields. Dynamic fields in Solr will handle new fields fine, but the plugin must not choke on unknown fields if it sees them. If your plugin uses Any or generic handling, it might simply dump unknown fields to the fallback text – which is okay, as later you can reindex properly.
  • Schema storage: Keep your proto IDL files under version control and maybe bundle them with the Solr config (in ZooKeeper or similar) so that the exact schema of the data is documented. In a complex system, you might even dedicate a collection to schema versions (though typically a file or database is enough). The goal is to avoid a situation where you have bytes in Solr and no clue how to decode them because the code moved on.

In summary, plan for schema evolution by embedding version/type info in each document and maintaining older schema knowledge. This will ensure proper deserialization for the life of the indexed data. It’s much like how a schema registry works – data records carry an identifier, and you look up the schema by that id to decode the record ([Protobuf Schema Serializer and Deserializer for Schema Registry on Confluent Platform | Confluent Documentation](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/serdes-protobuf.html#:~:text=The%20Confluent%20Schema%20Registry%20based,To%20learn)). Adopting this pattern in your Solr extension will make the system robust to changes and prevent data from becoming unreadable.

Conclusion

Implementing Protobuf support in Solr involves a combination of data transformation, schema design, and custom extension development. By flattening nested structures into Solr fields, using dynamic fields to catch new or variable elements, and handling special Protobuf types carefully, we can achieve full searchability of the data. Storing the original Protobuf bytes and type metadata ensures we can always get back the exact original message, meeting the full fidelity requirement. A custom Solr plugin (like an Update Request Processor) is the glue that performs the serialization/deserialization, giving fine-grained control over how each message is indexed and retrieved. Throughout the process, following best practices – such as leveraging RFC3339 for timestamps ([c# - Protobufs Timestamp as RFC 3339 string - Stack Overflow](https://stackoverflow.com/questions/76167711/protobufs-timestamp-as-rfc-3339-string#:~:text=expected%20behaviour%20is%20for%20RFC,used%20when%20mapping%20to%20JSON)), indexing all meaningful fields, and falling back gracefully for unsupported content – will result in a robust solution.

Potential challenges like performance overhead, large numbers of fields, and evolving schemas can be mitigated with the strategies discussed: for example, dynamic fields keep schema maintenance low ([Dynamic Fields :: Apache Solr Reference Guide](https://solr.apache.org/guide/solr/latest/indexing-guide/dynamic-fields.html#:~:text=Dynamic%20fields%20allow%20Solr%20to,explicitly%20define%20in%20your%20schema)), and storing schema version info allows handling of changes over time. It’s wise to test the approach with real data samples to ensure that queries perform well (e.g. ensure that extremely large nested structures don’t produce too many fields or that queries on multi-valued fields behave as expected). Also consider the size of the stored blob versus the indexed content – if messages are very large, you may choose to index only key fields to control index size, relying on the blob for the rest on retrieval.

In the end, this Solr extension will enable powerful querying on Protobuf data (something not natively possible since Protobuf is a binary format optimized for transport, not search ([python - Storing and Searching Protobuf messages in a database - Stack Overflow](https://stackoverflow.com/questions/67978343/storing-and-searching-protobuf-messages-in-a-database#:~:text=I%20think%20you%20should%20discard,wire%29%20not%20searching))) while retaining the rich data structure that Protobuf provides. By combining the strengths of Solr (text search, faceting, scalability) with a careful translation of Protobuf messages, you get the best of both worlds: a fully searchable index and fidelity to your source data.

⚠️ **GitHub.com Fallback** ⚠️