SWIRL - themaplelab/swan Wiki

SWAN InteRmediate Language (SWIRL)

Abstract

SWIRL is a linear IR designed to transform Swift Intermediate Language (SIL) into a simplified representation for use with static analysis engines. We specifically designed SWIRL to bridge the gap between SIL and Synchronized Pushdown Systems (SPDS). The SIL to SWIRL translation process features the following:

Drawbacks

SWIRL is naturally quite similar to SIL in its data structures and general design. However, it is more explicit with dataflow and types.

SWIRL

SWAN generates SWIRL in two main stages that SIL's two-stage process directly inspired.

  1. SWIRLGen produces raw SWIRL. SWIRL at this stage is translated mostly verbatim from SIL without any guarantees or semantic transformations.
  2. SWIRLPass produces canonical SWIRL from raw SWIRL.

SWIRLGen and raw SWIRL

SWIRLGen produces raw SWIRL by blindly translating SIL into SWIRL instructions. This stage does not verify if the functions, basic blocks, and values referenced by SIL instructions exist. SWIRLGen translates every function, basic block, and instruction without any significant context. Compilation from SIL into raw SWIRL is trivial mainly because we designed SWIRL's instruction set to funnel SIL instructions into it directly. The relationship between SIL and SWIRL instructions is generally many-to-one because multiple SIL instructions can translate into a single SWIRL instruction. However, there are cases where SWIRLGen translates a SIL instruction into multiple SWIRL instructions to reduce the amount of instructions SWIRL needs. These are the main properties of raw SWIRL/SWIRLGen:

Types

SIL types are tricky and complicated because they are a combination of ObjC, SIL, and Swift types. SWIRLGen tries to convert SIL types to SWIRL types (which are just named types) as best as possible based on the SIL documentation. SIL tends to have many implicit type conversions. For instance, many instructions modify the type of the operand to produce the result type. Some type conversions are placeholders because we do not know what the type semantics are for some instructions. For instance, type attributes are sometimes added or removed from the result of an instruction.

To maintain type fidelity, we try to respect SIL's type semantics and conversions. However, SWIRL does not have a type cast instruction. It ignores SIL's type casting instructions, so there can be implicit type conversions in SWIRL. In the future, we may add a simple instruction to make type casting explicit. Swift is also a type-safe language.

Coroutines

While SIL does have asymmetric coroutines, we do not currently consider preserving their dataflow as essential. Therefore, we treat begin_apply as apply, and SWIRLGen ignores abort_apply and end_apply.

We approximate coroutine dataflow by doing the following.

These approximations should not cause any practical dataflow issues. Coroutines seem to mainly be used by SIL for deallocating the yielded value after the caller no longer needs it. We have yet to see any interesting "back and forth" dataflow.

Globals

SIL has two types of globals: one that it initializes inside a static pseudo-function (called "global variable static initializer list") with the object instruction, and one that it initializes with alloc_global at some point in the program. We cannot handle the object instruction correctly due to missing type initialization information, so we ignore the initialization functions. Doing so should not be problematic because the initializer list cannot practically taint the object unless with a hardcoded string (e.g., password). We also ignore the alloc_global instruction. Instead, SWIRLGen creates a fake function called SWAN_FAKE_MAIN_<module_name> and initializes all globals there for explicitness. Then, the fake function calls the actual main function.

We treat globals as static fields of a singleton class through singleton_read and singleton_write instructions. SPDS globals inspired this choice, and this makes it easier to map SWIRL globals to SPDS rules. The globals belong to an arbitrarily named class (class in an abstract sense). Globals are named "Globals_<module_name>".

Dynamic Dispatch

SIL's dynamic dispatch lookup tables (value and witness tables) allow us to reconstruct a type hierarchy containing Swift's class/protocol relationships. We also include method/class relationships, so the graph is not limited to type relationships. We can use that graph, called a Dynamic Dispatch Graph (DDG), to find what functions dynamic_ref instructions resolve to. Furthermore, CG construction can prune the resolution in an RTA fashion. SWIRLGen also keeps track of all instantiated types for every function for RTA.

The DDG has different types of nodes.

Resolving a dynamic_ref instruction is a traversal problem on the DDG. Each dynamic_ref instruction has an index, and an Index node in the graph represents each index. Therefore, the index node is the starting point for each resolution. The traversal can be constrained by CG construction with types (for RTA) by only looking at methods with edges from given types.

Field Aliasing

SIL has instructions that take the address of a field or otherwise element under a non-primitive value, presenting a problem for a reference-based analysis. Therefore, we use a technique called field aliasing to alias the field of an object. Then, we look for uses of the aliases and resolve them depending on how other instructions use them. SWIRLGen adds the [alias ...] property to field_read instructions when the SIL instruction is taking the address of a field/element. SWIRLPass then resolves these aliases.

If a field_read is an alias, we make a new pointer and write that field to it as a default under-approximation. e.g.,

SIL:   %1 = some_sil_instruction %0, field_foo

SWIRL: %1 = new $`*T`
       %new = field_read [alias %1] %0, field_foo, $`T`
       pointer_write %new to %1

Resolving these aliases is explained in Field Alias Resolution.

Arrays

We implicitly treat all array reads because we simple copy or assign the array value to the result. However, we handle array writes explicitly through a pointer_write instruction with the [weak] attribute. SWIRLPass later converts the pointer_write to a field_write and preserves the weak semantics with the [weak_pointer] attribute. The engine must be sure to treat this instruction as a weak write (do not kill previous value).

SWIRLGen never generates the [weak] attribute. The arrays that SIL reads are low-level arrays and are not the Swift Standard Library (SSL) arrays. Therefore, we do not bother to resolve which values we need to weak write to, similar to what we do with field aliasing. We model the SSL containers with models, and we use the [weak] attribute where needed, such as in the Array setter function.

SWIRLPass and canonical SWIRL

SWIRLPass encompasses all translator passes that work to produce a final canonical SWIRL. The passes are probably the most volatile part of SWIRL currently. For instance, we may add "guarantee" passes in the future. The following passes are run on raw SWIRL to achieve canonical SWIRL.

Field Alias Resolution

As specified in Field Aliasing, SWIRLPass under-approximates field aliases with a generated pointer by default. However, there is a case where further transformations are required.

If the alias is used as a pointer_write destination, excluding the generated pointer_write that succeeds field_read aliased instructions, change pointer_write to field_write. e.g.,

%1 = new $`*T`
%new = field_read [alias %1] %0, field_foo, $`T`
pointer_write %new to %1
[...]
pointer_write %2 to %1 -> field_write %2 to %0, field_foo

All other instructions use the generated pointer (no action required). Suppose the value is used as a basic block argument or goes inter-procedural. In that case, we likely under-approximate the value's dataflow, mainly if SIL writes to it in a meaningful way, such as with a tainted value. In those cases, any write to the pointer will not result in the aliased field value itself changing. SIL causes under-approximation like this when it uses field aliases as apply arguments, especially in Swift class modify functions. While this may be a problematic under-approximation for false-negatives, nothing further can be done at the translation/IR level.

Simplification

This pass simplifies SWIRL down to simplify instruction semantics and enables a more straightforward translation to the receiving analysis engine IR.

cond_br

bb0:  
  cond_br %1, true bb1, false bb2
// ----- to -----
bb0:
  br_if %1, bb1
bb0_0: // new branch
  br bb2

switch

bb0:
  switch %0, case %1 : bb1, case %2 : bb2, ..., default bb3
// ----- to -----
bb0:
  %1_new = binary_op %0 [eq] %1, $`Builtin.Int1`
  br_if %1_new, bb1
bb0_0:
  %2_new = binary_op %0 [eq] %2, $`Builtin.Int1`
  br_if %2_new, bb2
[...]
bb0_n:
  br bb3

If no default exists, the last br_if is simply a br.

switch_enum

bb0:
  switch_enum %0, case "#U.Case1!enumelt" : label1, \
                  case "#U.Case2!enumelt" : label2, \
                  ...
                  default labelN
// ----- to -----
bb0:
  %1 = field_read %0, type, $`Builtin.RawPointer`
  %2 = field_read %0, data, $`Any`
  %3 = literal [string] "#U.Case1!enumelt"
  %3_new = binary_op %1 [eq] %3, $`Builtin.Int1`
  br_if %3_new, label1(%2)
bb0_0:
  %4 = literal [string] "#U.Case2!enumelt"
  %4_new = binary_op %1 [eq] %4, $`Builtin.Int1`
  br_if %4_new, label2(%2)
[...]
bb0_n;
  br labelN(%2)

If no default exists, last br_if is simply a br. data is only given as basic block argument if the basic block takes a parameter.

switch_enum_assign

bb0:
  %n = switch_enum_assign %0, case "#U.Case1!enumelt" : %1, \
                       case "#U.Case2!enumelt" : %2, \
                       default %3, $`T`
// ----- to -----
bb0:
  switch_enum %0, case "#U.Case1!enumelt" : bb0_0, \
                  case "#U.Case2!enumelt" : bb0_1, \
                  ...
                  default bb0_default            
bb0_0:
  br cont(%1)
bb0_1:
  br cont(%2)
[...]
bb0_default;
  br cont(%3)
bb0_cont:(%n : $`Any`)
  // ...

Convert to the equivalent switch_enum, as per SIL.rst documentation. This must be done before switch_enum is simplified.

pointer_write

pointer_write %0 to %1
// ----- to -----
field_write %0 to %1, value

pointer_read

%0 = pointer_read %1, $`T`
// ----- to -----
%0 = field_read %1, value, $`T`

operators

We break operators up because SPDS rules do not support them in their abstract form. We can simplified them by breaking them down into instructions that create the desired dataflow. For instance, we break down the addition operation into two assign instructions (which are both weak writes). We group operators like addition, subtraction, concatenation, etc., under a single operation: [reg] (regular) because their dataflow is the same.

The [arb] operation represents an arbitrary operation. Normally these come from some low-level instruction we do not care about.

unary_op

%1 = unary_op [arb] %0, $`T`

[arb] -> %1 = new $`T`

binary_op

%2 = binary_op %0 [<operation>] %1, $`T`

[arb] -> %2 = new $`T`
[eq]  -> %2 = new $`T`
[reg] -> %2 = assign %0, $`T`
         %2 = assign %1, $`T`

Basic Block Argument Removal

This pass converts basic block arguments (which are phi nodes) to simple assignments. It also removes arguments from blocks. Lastly, canonical SWIRL has explicit function arguments, which before were implicitly the first block's arguments.

  br bb1(%0, %1)
bb1(%a, %b):
  // ...
// ----- to -----
  %a = assign %0
  %b = assign %1
  br bb1
bb1:

Control Flow Graph

SWIRLPass generates a control flow graph (CFG) for each function. The SWIRL printer can print the CFG before each function.

Symbol Table Generation

A symbol table is generated for each function and then used to make sure all referenced values exist.

Syntax

We naturally took inspiration from SIL for the formatting and style of SWIRL. This specification may not be exact because we are still refining the IR, but it should be pretty close.

Values and Operands

string-literal      ::= [A-Za-z_0-9]+
swirl-type         ::= '$' '`' string-literal '`'
swirl-identifier   ::= '`' string-literal '`'
swirl-value-name   ::= string-literal
swirl-value        ::= swirl-value-name
swirl-value        ::= 'undef'
swirl-global-value ::= sil-global-value
swirl-operand      ::= swirl-value-name ':' swirl-type
swirl-argument     ::= swirl-operand

swirl-literal      := swirl-literal-type
swirl-literal-type := '"' string-literal '"'
swirl-literal-type := int
swirl-literal-type := float

swirl-field-name   := string-literal

sil-type is a string literal because SIL types are complex, and we translate SIL type to a SWIRL type by using the string representation of the type. SWIRL also surrounds many components with ` for easier parsing.

Functions

swirl-function ::= 'swirl' swan-function-attribute?
                      swirl-function-name ':' swirl-type
                      '{' swirl-basic-block+ '}'

swan-function-name  ::= '@' '`' [A-Za-z_0-9]+ '`'

Function Attributes

Subject to change. See the Modules and Modelling section for more details on how we use these attributes.

function-attribute ::= '[coroutine]'

The function is a coroutine.

function-attribute ::= '[stub]'

SWIRLGen generates stubs for functions with no body.

function-attribute ::= '[model]'

The function is a handwritten model.

function-attribute ::= '[model_override]'

Module grouper overrode a regular (user-implemented) function with a model (as opposed to stub overridden with model - i.e., [model]).

function-attribute ::= '[linked]'

The module grouper replaced a function stub with implementation (from another Module).

Basic Blocks

swirl-basic-block ::= swirl-label swirl-instruction-def* swirl-terminator
swirl-label ::= swirl-value ('(' swirl-argument (',' swirl-argument* ')')? ':'

line := [0-9]+
column := [0-9]+
swirl-loc := 'loc' url-path ':' line ':' column
swirl-instruction-result ::= swirl-value-name
swirl-instruction-result ::= (' (swirl-value-name (',' swirl-value-name)*)? ')'
swirl-instruction-source-info ::= swirl-loc?
swirl-instruction-def ::= 
  (swirl-instruction-result '=')? swirl-instruction swirl-instruction-source-info

Basic blocks can take in arguments. These arguments are phi values because multiple blocks can branch to the same block.

Instructions

Syntax:

Whether a type is a pointer is specified for completeness and helps justify its usage. However, $T does not mean that $T is necessarily not a pointer. e.g., new instruction can instantiate both pointers and non-pointers.

SIL does implicit type conversion (in its plain-text form), but SWIRL does not because we do not make any further type transformations after transforming SIL to canonical SWIRL. Therefore, the pointer_read instruction, for instance, does not have a pointer type because it represents the concrete result type.


Operators

new

swirl-instruction ::= 'new' swirl-type

%1 = new $`T`
// %0 is of type T

Allocates a new value %1 with type T. The result type can also be a pointer.

assign

swirl-instruction ::= 'assign' swirl-value-name ',' swirl-type

%1 = assign %0, $`T`
// %1 will be of type T

Assign %0 to %1. The result type can also be a pointer. We use this instruction when a COPY is not sufficient. This is a weak assign (do not kill existing information of %1). SIL is an SSA-form IR. Therefore, we use assign to break up phi (block arguments). A weak assignment is also appropriate virtually everywhere, and we do not have a strong version of this instruction.

literal

swirl-instruction ::= 'literal' swirl-literal-type swirl-literal ',' swirl-type

swirl-literal-type := '[int]'
swirl-literal-type := '[float]'
swirl-literal-type := '[string]'

%1 = literal [string] "foo", $`T`
// %1 will be of type T

Write a literal to %1.

builtin_ref

swirl-instruction ::= 'builtin_ref' swirl-function-name ',' swirl-type

%1 = builtin_ref @`assumeNonNegative_Int64`, $`T`
// %1 will be of type T

Write an unknown builtin reference to %1.

dynamic_ref

swirl-instruction ::= 'dynamic_ref' swirl-function-name ',' swirl-type

%1 = dynamic_ref @`#A.init`, $`T`
// %1 will be of type T

Write a dynamic index to %1. See the Dynamic Dispatch section for more information.

function_ref

swirl-instruction ::= 'function_ref' swirl-function-name ',' swirl-type

%1 = function_ref @`foo`, $`T`
// %1 will be of type T

Write a function reference to %1.

apply

swirl-instruction ::= 'apply' swirl-value-name '(' (swirl-value-name (',' swirl-value-name)*)? ')' ',' swirl-type

%r = apply %0(%1, %2, ...), $`T`
// %r will be of type T

Apply (call) a function given a function reference and arguments.

singleton_read

swirl-instruction ::= 'singleton_read' swirl-identifier 'from' swirl-string ',' swirl-type

%0 = singleton_read `myglobal` from Globals, $`T`
// %0 will be of type T

This instruction reads a static field from a singleton class to the result. See Globals section.

singleton_write

swirl-instruction ::= 'singleton_write' swirl-identifier 'to' swirl-identifier 'in' swirl-string

singleton_write %0 to `myglobal` in Globals

This instruction writes a value to a static field of a singleton class.

field_read

swirl-instruction ::= 'field_read' ( '[alias ' swirl-value-name ']' )? ( '[pointer]' )? swirl-value-name ',' swirl-field-name ',' swirl-type

%1 = field_read [alias %foo] %0, bar, $`Any`
// %1 will be of type Any

Read an object's field into %1. Optionally, we specify an alias for field aliasing (see Field Aliasing section). The result type for field_read is unknown because that information is not available in plain-text SIL. Therefore we cannot determine a result type. [pointer] is a readability specifier in canonical SWIRL to indicate this field_read was originally a pointer_read.

field_write

swirl-instruction ::= 'field_write' ( swirl-field-write-attr )? swirl-value-name 'to' swirl-value-name ',' swirl-field-name

swirl-field-write-attr := '[pointer]'
swirl-field-write-attr := '[weak_pointer]'
swirl-field-write-attr := '[weak]'

field_write $0 to %1, foo

Write %0 to object %1's foo field. [pointer] is a readability specifier in canonical SWIRL to indicate this field_write was originally a pointer_write. [weak_pointer] and [weak] specifies a weak write (do not kill previous value). [weak_pointer] specifies that the field_write was original a pointer_write [weak].

unary_op

swirl-instruction ::= 'unary_op' swirl-unary-op swirl-value-name ',' swirl-type

swirl-unary-op := '[arb]'
swirl-unary-op := ...

%1 = unary_op [arb] %0, $`T`
// %1 will be of type T

Apply a unary operation on %0 and write the result to %1. Raw only.

binary_op

swirl-instruction ::= 'binary_op' swirl-value-name swirl-binary-op swirl-value-name ',' swirl-type

swirl-binary-op := '[add]'
swirl-binary-op := '[arb]'
swirl-binary-op := '[eq]'

%2 = binary_op %0 [arb] %1, $`T`
// %2 will be of type T

Apply a binary operation on %0 and %1 and write the result to %2. Raw only.

cond_fail

swirl-instruction ::= 'cond_fail' swirl-value-name

cond_fail %0

Conditional runtime failure same as SIL's cond_fail instruction. Current semantics: No return-to-caller and no dataflow continuation if the operand is true. Because it is not a terminator, we ignore this instruction as an over-approximation, but dead-code elimination could cut the CFG edge to the next instruction later (Boomerang CFG is at a statement level, not block-level).

switch_enum_assign

swirl-instruction ::= 'switch_enum_assign' swirl-value-name
                          (',' swirl-enum-assign-case)* \
                          (',' 'default' swirl-value-name)? \
                          ',' swirl-type
											
swirl-enum-assign-case ::= 'case' sil-decl-ref ':' swirl-value-name

%n = switch_enum_assign %0, case "#U.Case1!enumelt" : %1, \
                       case "#U.Case2!enumelt" : %2, \
                       default %3, $`T`
// %n will be of type T

This instruction is the same as SIL's select_enum instruction. Raw only.

switch_value_assign

swirl-instruction ::= 'switch_valye_assign' swirl-value-name
                          (',' swirl-value-assign-case)* \
                          (',' 'default' swirl-value-name)? \
                          ',' swirl-type
											
swirl-value-assign-case ::= 'case' swirl-value-name ':' swirl-value-name

%n = switch_value_assign %0, case %1 : %4, \
                       case %2 : %5, \
                       default %3, $`T`
// %n will be of type T

This instruction is the same as SIL's select_value instruction. Raw only.

pointer_read

swirl-instruction ::= 'pointer_read' swirl-value-name ',' swirl-type

%1 = pointer_read %0, $`T`
// %1 will be of type T

Represents a pointer read. Raw only.

pointer_write

swirl-instruction ::= 'pointer_write' ( '[weak]' )? swirl-value-name 'to' swirl-value-name

pointer_write %0 to %1

Represents a pointer write. We specify [weak] where we need a weak write. See the Arrays section for more details. Raw only.

Terminators

br

swirl-instruction ::= 'br' swirl-identifier \
                         ( '(' swirl-value-name (',' swirl-value-name)* ')' )?

br bb1
br bb1 (%0, %1)

Branch instruction. Optionally, passes values to branch if the branch takes arguments. The canonical form uses an argument-free version of this instruction.

br_if

swirl-instruction ::= 'br_if' swirl-value-name \
                         ','  swirl-identifier \
                         ( '(' swirl-value-name (',' swirl-value-name)* ')' )?

br_if %0, bb1(%a, %b, ...)

Branch to block if the operand is true. The operand must be of SIL type $Builtin.Int1, so semantically "true" means the operand is 1. Otherwise, fall through to the next block. The canonical form uses an argument-free version of this instruction. Only used by Canonical but is valid in Raw.

cond_br

swirl-instruction ::= 'cond_br' swirl-value-name \
                         ',' 'true' swirl-identifier \
                         ( '(' swirl-value-name (',' swirl-value-name)* ')' )?
                         ',' 'false' swirl-identifier \
                         ( '(' swirl-value-name (',' swirl-value-name)* ')' )?

cond_br %0, true bb1(%a, %b, ...), false bb2(%x, %y, ...)

Branch based on whether the operand is true. The operand must be of SIL type $Builtin.Int1, so semantically "true" means the operand is 1. Raw only.

switch

swirl-instruction ::= 'switch' swirl-value-name
                          (',' swirl-switch-case)* \
                          (',' 'default' swirl-identifier)

swirl-switch-case ::= 'case' swirl-value-name ':' swirl-identifier

switch %0, case %1 : bb1, case %2 : bb2, ..., default bb3

This instruction is the same as SIL's switch_value instruction. Destination blocks do not take in arguments. Raw only.

switch_enum

swirl-instruction ::= 'switch_enum' swirl-value-name
                          (',' swirl-switch-enum-case)* \
                          (',' 'default' swirl-identifier)?
											
swirl-switch-enum-case ::= 'case' sil-decl-ref ':' swirl-identifier

switch_enum %0, case "#U.Case1!enumelt" : label1, \
                case "#U.Case2!enumelt" : label2, \
                ...
                default labelN

This instruction is the same as SIL's switch_enum instruction. Destination blocks do not take in arguments. Raw only.

return

swirl-instruction ::= 'return' swirl-value-name

return $0

Regular return instruction. While SIL instructions can have multiple results, in SWIRL, we only support returning a single value.

unreachable

swirl-instruction ::= 'unreachable'

unreachable

Control flow never reaches this instruction. Must be preceded by an apply instruction with the return type Never. The function will abort the entire program, and therefore the next instruction after apply is never reached. Semantics: no return-to-caller and no dataflow continuation.

yield

swirl-instruction ::= 'yield' swirl-yield-values
                         ',' 'resume' swirl-identifier
                         ',' 'unwind' swirl-identifier
                         
swirl-yield-values ::= '(' (swirl-value-name (',' swirl-identifier)*)? ')'

yield (%0, ...), resume bb1, unwind bb2

This instruction is the same as SIL's yield instruction. See the Coroutines section for more details on semantics.

Implicit

COPY

Many instructions are translated as a symbol table copy when assign is unnecessary because this reduces the number of instructions by avoiding having assign instructions everywhere. In the future we may blow this away and use assign instructions, or even create and use a cast instruction, to have explicit type conversions.

We define the following as a documentation-only definition of this implicit "instruction" because it is not an actual SWIRL instruction.

swirl-instruction ::= 'COPY' swirl-value-name 'to' swirl-value-name

COPY %0 to %1

However, this operation is not printed and is internal.

NOP

Ignore the instruction.

Modules and Modelling

Each Swift application consists of one or more modules. Usually, one of these represents the user code, and the rest are libraries, which can be numerous, especially in CocoaPods applications. Therefore, each module is processed individually and contains its own functions. Modules can reference each other, and therefore we merge these modules into a single "grouped" module by carefully weaving their functions together, overwriting stubs with implementations, stubs with models, and implementations with models. We provide models as a separate module written in raw SWIRL and parsed alongside the other modules. For instance, here is the model for the SSL Array getter function.

// %0 -> value to write, %1 -> index, %2 -> array
// Write array to pointer %0. Index not used.
// Also likely returns the element, but the return value isn't used in practice.
func [model] @`Swift.Array.subscript.getter : (Swift.Int) -> A` : $`@out τ_0_0` {
bb0(%0 : $`*Any`, %1 : $`Int`, %2 : $`Array<τ_0_0>`):
  %3 = pointer_read %2, $`τ_0_0`
  pointer_write %3 to %0
  return %3
}

We allow our models to override user function implementations, and we make it evident when we do this with [model_override] function attribute. For the most part, we overwrite [stub] functions, which SWIRLGen generates when a function has no body, with models ([model] attribute). Lastly, when a module provides an implementation to another module, the [stub] is overwritten with the implementation. We give this function the [linked] attribute.


SIL to raw SWIRL translation is available here.