Idea Exploration: Data Markings Alternatives - STIXProject/specifications GitHub Wiki

Use Case

These solutions explore alternatives to meet the "advanced" use case identified in Asserting Data Markings on Content. As an example, imagine this indicator:

  • Indicator
    • Title: TLP:GREEN
    • Description: TLP:RED
    • Pattern: TLP:GREEN
    • Producer: TLP:RED

For the most part, the indicator is TLP:GREEN. Two fields, however, are TLP:RED. For the exchange, imagine that Organization A wants to share the full indicator with Organization B. Organization B then wants to be able to share that indicator with Organization C, but because the Producer and Description are RED those fields should be removed.

Note: All examples are simplified for the purposes of the example

Options

Option 1: Markings reference content

This is the STIX 1.x approach: markings are defined at the head of the document and reference the content that they mark. In the STIX XML binding this is accomplished via XPath:

<STIX_Package>
  <Marking>
    <Controlled_Structure>//Indicator[@id="1234"]</Controlled_Structure>
    <Marking_Structure>TLP:GREEN</Marking_Structure>
  </Marking>
  <Marking>
    <Controlled_Structure>//Indicator[@id="1234"]/Producer</Controlled_Structure>
    <Controlled_Structure>//Indicator[@id="1234"]/Description</Controlled_Structure>
    <Marking_Structure>TLP:RED</Marking_Structure>
  </Marking>
  <Indicator>
    <Title>Indicator Title</Title>
    <Description>Indicator Description</Description>
    <Pattern><Substructure>Here</Substructure></Pattern>
    <Producer>
      <Name>Name</Name>
      <Address>Address</Address>
    </Producer>
  </Indicator>
</STIX_Package>

While this approach makes a lot of sense in theory, in practice it's very difficult to implement. Anecdotal evidence has one project spending many months writing just data markings code.

Advantages

  • Markings are defined only once
  • Content not defined by STIX (including CybOX and extensions like CIQ) can easily be marked
  • The content itself is not burdened by redirection and complexity due to the markings (i.e. see approach #2)

Disadvantages

  • Perhaps the biggest downside is that even simple uses (marking a top-level object, per the basic use case) are very complicated because they still require XPath (unless you hardcode a set of XPaths, which would need to be agreed-upon beforehand).
  • The XPath statements are hard to craft programatically automatically against arbitrary content.
  • After content is parsed into an object model, it's difficult to go back and apply the markings contained in the XPath (JAXB apparently can do this automatically, but those in other languages are out of luck).
  • Due to XML-specific issues regarding namespaces and namespace aliases, the XPath statements may not transfer from one version of a document to another (e.g. if you change a namespace prefix, the XPath becomes invalid)
  • Running arbitrary XPaths from shared content can introduce security issues.
  • The "pointer" approach requires that the language support XPath-type queries. In XML you obviously have XPath, in JSON you have JSONPath, but what else?

Open Questions

  1. Does anyone want to defend this option over Option 2 or 3, or can we eliminate it?
  2. If we go with JSON, is JSONPath mature enough?

Option 2: Content references markings

This is a more traditional marking approach: each markable construct contains a reference to one or more markings that should be applied to it. For the basic use case, markable constructs might only be top-level objects (or even individual packages). For the advanced use case explored here, markable constructs include all data points (perhaps excluding structural data).

For this example, I'll include both JSON and XML representations because in XML and JSON you have different approaches.

JSON

Value Indirection

In this approach, each "object" structure has an @markings field that references the markings. Primitive value fields are broken into @value (with the actual value) and @markings.

{
  "markings": [
    {"@id": "us-cert:TLP-GREEN", "tlp": "GREEN"},
    {"@id": "us-cert:TLP-RED", "tlp": "RED"}
  ],
  "indicators": [
    {
      "@id": "example.com:indicator-1234",
      "@markings": ["us-cert:TLP-GREEN"],
      "title": {
        "@value": "Some title"
      },
      "description": {
        "@value": "Indicator Description",
        "@markings": ["us-cert:TLP:RED"]
      },
      "pattern": {
        "substructure": {
          "@value": "here"
        }
      },
      "producer": {
        "name": {
          "@value": "Name"
        },
        "address": {
          "@value": "Address"
        },
        "@markings": ["us-cert:TLP-RED"]    
      }
    }
  ]
}

This approach is very obvious (low cognitive load, it's very clear how things are marked) but has a lot of overhead. For producers and consumers that just want to work with the data, it will have a high burden.

Advantages

  • Obvious how things are marked
  • Easily support inheritance of markings (you can say markings inherit until you encounter another markings specification)
  • Easy to apply markings while you parse the document (since they're just data on the field, rather than being applied from an external location)

Disadvantages

  • Content and markings are intermingled, so there's a massive amount of indirection for value fields (i.e. strings, numbers, booleans) that now require an @value to support the @markings.
  • Not able to implement markings on extensions, except at the highest level (i.e. like an adapter)
  • Not supportable as an extension: the structure would change if you try to implement this approach in an extension to STIX.
Markings next to content

In this approach, each key has an associated key.markings element with references to the markings.

{
  "markings": [
    {"@id": "us-cert:TLP-GREEN", "tlp": "GREEN"},
    {"@id": "us-cert:TLP-RED", "tlp": "RED"}
  ],
  "indicators": [
    {
      "@id": "example.com:indicator-1234",
      "title": {
        "@value": "Some title"
      },
      "description": "Indicator Description",
      "description.markings": ["us-cert:TLP:RED"],
      "pattern": {
        "substructure": "here"
      },
      "producer": {
        "name": {
          "@value": "Name"
        },
        "address": {
          "@value": "Address"
        },
      },
      "producer.markings": ["us-cert:TLP-RED"]
    }
  ],
  "indicators.markings": [["us-cert:TLP-GREEN"]]
}

This approach is somewhat less obvious than putting the markings directly on the objects. In particular, for JSON arrays you need to match up items in the array with items in the associated markings array. It could be somewhat mitigated by a hybrid approach where only primitive types require the "field.markings" approach, but that's inconsistent and still fairly indirect when you have primitives in arrays.

Advantages

  • Somewhat obvious how things are marked
  • Easily support inheritance of markings (you can say markings inherit until you encounter another markings specification for that field)
  • Somewhat easy to apply markings as you parse the document (just look in the field "next door")
  • Would be supportable as an extension (in particular the hybrid approach). If you define a "markings" field at the top-level object and standardize that in STIX 2.0, you can support markings at other levels via this approach.

Disadvantages

  • The approach to marking items in an array is clunky
  • The hybrid approach would be inconsistent

XML

Note: these are simplified XML for the purposes of the example

Element-based approach

This approach is similar to (or identical to, really) the "value indirection" approach in JSON.

<STIX_Package>
  <Marking id="us-cert:TLP-GREEN">TLP:GREEN</Marking>
  <Marking id="us-cert:TLP-RED">TLP:RED</Marking>
  <Indicator>
    <Title>
      <Value>Indicator Title</Value>
    </Title>
    <Description>
      <Value>Indicator Description</Value>
      <Markings>
        <Marking id="us-cert:TLP-RED">
      </Markings>
    </Description>
    <Pattern>
      <Substructure>
        <Value>Here</Value>
      </Substructure>
    </Pattern>
    <Producer>
      <Name>
        <Value>Name</Value>
      </Name>
      <Address>
        <Value>Address</Value>
      </Address>
      <Markings>
        <Marking id="us-cert:TLP-RED">
      </Markings>
    </Producer>
  </Indicator>
</STIX_Package>

As with the corresponding JSON-based method, this approach is very obvious (low cognitive load, it's very clear how things are marked) but has a lot of overhead. For producers and consumers that just want to work with the data, it will have a high burden.

Advantages

  • Obvious how things are marked
  • Easily support inheritance of markings (you can say markings inherit until you encounter another markings specification)
  • Easy to apply markings while you parse the document (since they're just data on the field, rather than being applied from an external location)
  • Easily allows for a list of markings via standard XML structures

Disadvantages

  • Content and markings are intermingled, so there's a massive amount of indirection for value fields (i.e. strings, numbers, booleans) that now require a Value to support the Markings.
  • Not able to implement markings on extensions, except at the highest level (i.e. like an adapter)
  • Not supportable as an extension: the structure would change if you try to implement this approach in an extension to STIX.
Attribute-Based Approach

This approach leverages XML attributes (the list capability in particular) to support the same capabilities as the element-based approach but without as much indirection.

<STIX_Package>
  <Marking id="us-cert:TLP-GREEN">TLP:GREEN</Marking>
  <Marking id="us-cert:TLP-RED">TLP:RED</Marking>
  <Indicator markings="us-cert:TLP-GREEN">
    <Title>Indicator Title</Title>
    <Description markings="us-cert:TLP-RED">Indicator Description</Description>
    <Pattern>
      <Substructure>Here</Substructure>
    </Pattern>
    <Producer markings="us-cert:TLP-RED">
      <Name>Name</Name>
      <Address>Address</Address>
    </Producer>
  </Indicator>
</STIX_Package>

This approach greatly reduces the amount of indirection by leveraging XML attribute lists to refer to markings.

Advantages

  • Obvious how things are marked
  • Easily support inheritance of markings (you can say markings inherit until you encounter another markings specification)
  • Easy to apply markings while you parse the document (since they're just data on the field, rather than being applied from an external location)
  • Easily allows for a list of markings via standard XML structures
  • Almost no indirection
  • Works as an extension

Disadvantages

  • Not able to implement markings on extensions, except at the highest level (i.e. like an adapter)
  • Requires that consumers (including their software libraries) be able to implement XML List types.
  • Does not allow for markings to be directly embedded (must be referenced)

Open Questions

  1. Do the advantages/disadvantages matter here depending on whether we do JSON or XML? Is one preferred?
  2. If we go with an extension approach, which is the preferred "sub-option"?
  3. If we go with a core approach, is that different?

Option 3: Marking only top-level packages, send multiple versions

In this approach, just the package (or, similarly, just top-level objects) are marked. Multiple packages are sent at different TLP levels.

{
  "markings": [{"tlp": "GREEN"}],
  "indicators": [
    {
      "@id": "example.com:1234",
      "title": "Indicator Title",
      "pattern": {"substructure": "here"}
    }
  ]
}

*For the object marking approach, you could either define the markings once in the package and reuse them via references, or just redefine them in each object.*

The advantages of this base capability are strong:

* Very easy to describe and interpret markings, because they only appear in distinct locations
* Very easy to manage at a system level for the same reason.

To support multiple markings in a single package you would issue multiple versions of the package (or, again, the construct if you do it at that level). For example, for the TLP:RED content in the above packages you would issue a second package:
 
```json
{
  "markings": [{"tlp": "RED"}],
  "indicators": [
    {
      "@id": "example.com:1234",
      "title": "Indicator Title",
      "description": "Indicator Description",
      "pattern": {"substructure": "here"},
      "producer": {
        "name": "Name",
        "address": "Address"
      }
    }
  ]
}

Note that at this point we've sent the indicator twice: once a TLP:GREEN version with a subset of the fields, and once as a TLP:RED version with all of the fields.

Open Questions

  1. Consumer systems receiving both versions would need to understand that they describe the same thing. One way to do this would be a relationship of "EQUIVALENT TO" that the versions could use to relate to each other, but still consumer tools would need to understand that relationship.
  2. Data gets marked as over-classified: the entire indicator is not TLP:RED in the example above, yet it's marked that way. In some organizations, policy may prohibit this and, if nothing else, it's confusing.
  3. If you have 2 different types of markings on the object (or package) you issue two packages. What if you have a lot of different types of markings (TLP:RED + Vendor A proprietary + Vendor B proprietary)? You need to issue many, many versions of the object to include the various combinations (TLP:RED + Vendor A; TLP:RED + Vendor B, TLP:RED + Vendor A + Vendor B; TLP:GREEN, TLP:GREEN + Vendor A, etc. etc.)

On the other hand, this approach is conducive to an extension: core STIX does not define the above behavior, but sharing communities that require it are able to define it for themselves easily using existing STIX constructs.

Option 4: Marking only top-level packages, send multiple versions

In this approach, all top level objects can contain their own markings.

{
    "type": "indicator",
    "marking": {
        "tlp": "white || green || amber || red",
	"extra": "true || false",
        "share": "public || limited || no",
        "jurisdiction": [
            "EU",
            "Safe Harbour"
        ],
        "anonymize": "true || false",
        "details": "some really long detailed text with extra context",
        "handling": {
            "encrypt-at-rest": "true || false",
            "encrypt-in-transit": "true || false"
	}
    }
}

Option 5: Choice of object-level or XPath

In this approach, all top level objects can contain their own markings. You can either use the Marking element, which marks the entire construct, or the Controlled_Marking element, which behaves as STIX 1.2 markings.

  • If the Marking element is used, the markings are applied to the entire construct and Controlled_Structure is prohibited. Markings are inherited by children unless overridden locally (i.e. package markings apply to all indicators in the package, unless the indicator contains a Marking element of the same type that overrides it. Additionally, controlled structures in other elements (e.g. the package) may not target fields within the construct. In other words, if you use or see a Marking element on a construct you don't need to worry about a Controlled_Marking somewhere overriding it.
  • If the Controlled_Marking element is used, it will behave the same as STIX 1.2 markings.

The conformance section for the data markings spec would specify three conformance levels:

  • Level 0: No support for markings
  • Level 1: Able to handle Marking element, but not Controlled_Marking
  • Level 2: Able to handle both Marking and Controlled_Marking

XML

<STIX_Package>
  <Marking_Structure id="us-cert:TLP-RED">TLP:RED</Marking_Structure>
  <Marking_Structure id="us-cert:TLP-GREEN">GREEN</Marking_Structure>
  <Indicator>
    <Marking idref="us-cert:TLP-RED" />
    <Title>
      <Value>Indicator Title</Value>
    </Title>
    <Description>
      <Value>Indicator Description</Value>
    </Description>
    <Pattern>
      <Substructure>
        <Value>Here</Value>
      </Substructure>
    </Pattern>
    <Producer>
      <Name>
        <Value>Name</Value>
      </Name>
      <Address>
        <Value>Address</Value>
      </Address>
    </Producer>
  </Indicator>
  <Indicator>
    <Controlled_Marking marking-idref="us-cert:TLP-GREEN">
      <Controlled_Structure>.</Controlled_Structure>
    </Controlled_Marking>
    <Controlled_Marking marking-idref="us-cert:TLP-RED">
      <Controlled_Structure>./Producer</Controlled_Structure>
      <Controlled_Structure>./Description</Controlled_Structure>
    </Controlled_Marking>
    <Title>
      <Value>Indicator Title</Value>
    </Title>
    <Description>
      <Value>Indicator Description</Value>
    </Description>
    <Pattern>
      <Substructure>
        <Value>Here</Value>
      </Substructure>
    </Pattern>
    <Producer>
      <Name>
        <Value>Name</Value>
      </Name>
      <Address>
        <Value>Address</Value>
      </Address>
    </Producer>
  </Indicator>
</STIX_Package>

JSON

{
  "markings": [
    {"@id": "us-cert:TLP-GREEN", "tlp": "GREEN"},
    {"@id": "us-cert:TLP-RED", "tlp": "RED"}
  ],
  "indicators": [
    {
      "@id": "example.com:indicator-1234",
      "markings": ["us-cert:TLP-GREEN"],
      "title": {
        "@value": "Some title"
      },
      "description": {
        "@value": "Indicator Description",
      },
      "pattern": {
        "substructure": {
          "@value": "here"
        }
      },
      "producer": {
        "name": {
          "@value": "Name"
        },
        "address": {
          "@value": "Address"
        }
      }
    },
    {
      "@id": "example.com:indicator-4321",
      "controlled_marking": [
        {
          "marking_ref": "us-cert:TLP-GREEN",
          "controlled_structures": ["$['indicators'][1]"]
        },
        {
          "marking_ref": "us-cert:TLP-RED",
          "controlled_structures": ["$['indicators'][1].description", "$['indicators'][1].producer"]
        }
      ],
      "title": {
        "@value": "Some title"
      },
      "description": {
        "@value": "Indicator Description",
      },
      "pattern": {
        "substructure": {
          "@value": "here"
        }
      },
      "producer": {
        "name": {
          "@value": "Name"
        },
        "address": {
          "@value": "Address"
        }
      }
    }
  ]
}

Advantages

  • Simple things are simple (just use top-level markings)
  • More complicated use case is doable

Disadvantages

  • Not an improvement of XPath approach for field-level markings

Action Items

  1. Determine if we need to support field-level markings as a core part of STIX (CybOX/TAXII) or whether we can do object-level (or package-level) and then have an extension to do field-level.
  2. Eliminate Option 1 if the community desires, as we tried it in STIX 1.x and it has challenges.
  3. Identify the preferred approaches for Option 2, likely across JSON and XML.
  4. Answer the open questions in Option 3 and, ideally, provide some workflows so we can see how it would work for the simple and more complicated cases.
⚠️ **GitHub.com Fallback** ⚠️