IR Research —— Myia - leozp/Myia-Issues GitHub Wiki
参考材料:
Myia框架
Automatic Differentiation in Myia

Forward-Mode
forward-mode AD applies the chain rule starting from the inputs and requires n evaluations of the transformed program to compute the derivative.
a full derivation shall complete in n evaluation.
基于链式法则的正向求导,计算规模和输入相关
Reverse-Mode
reverse-mode AD starts from the output and requires m evaluations.
a full derivation completes in one evaluation.
反向模式大大简化计算量

Two different implementation approaches are distinguished in the AD literature: Operator overloading (OO) involves tracing the program at runtime. Tracing follows function calls and control flow, so the final trace is a linear series of elementary operations (the tape) to which we can apply the chain rule. This approach is easy to implement and flexible, but if the execution path diverges it requires repeated tracing and application of the chain rule. Moreover, the forward pass is executed regardless of whether the intermediate values are required for the backward pass or not. Source code transformation (SCT) consists of applying the chain rule directly to an intermediate representation of the program, producing a new program that computes the derivative. This requires a transformation that reverses the execution path of the program, which means we need to explicitly consider control flow operators and function calls. Note that the resulting program must still use a runtime data structure to store intermediate values.
Automatic differentiation in ML
Theano dataflow graph(computation graph) Tensorflow dataflow programming paradigm PyTorch operator overloading
- Like Theano or TensorFlow, it should be amenable to static analysis and global optimization.
- Like PyTorch, the automatic differentiation should be fully general and support all major programming language constructs (function calls, conditionals, loops, recursion, etc.)
- Tight integration with the Python ecosystem, which is still the language of choice of most deep learning researchers.
This particular approach satisfies the desire to be able to differentiate any valid program while supporting higher-order differentiation. However, there are still potential issues regarding optimization in a naive implementation:
- The operator is dynamic and may require runtime introspection. For example, it is possible to apply on itself arbitrarily many times, including a number of times computed by the program itself. However, we do not want to encumber our runtime with a compiler. We must thus limit valid programs to those for which we can identify all possible arguments to J , so that we can compile their transform ahead of time. Fortunately, unbounded self-application of J is not typical, so we believe this restriction is acceptable.
- SCT modifies functions to return two values: the original return value and a back propagator closure. Likewise, the back propagator returns multiple values corresponding to gradients with respect to each of its inputs. This is represented as tuples of return values, thus many calls now involve the allocation of data structures. However, as one may observe in figure 2, these tuples are deconstructed as soon as they are returned, so we can solve the issue with a form of copy elision/return value optimization.
- Backpropagationisperformedbycallingclosuresthatencapsulatethenecessaryinformation, but calling a closure is more costly than calling a known function. This being said, in any situation where the target of a call is known in the original program, the targets of the calls produced by the transform are also known: the transform does not add uncertainty in the program, therefore optimizations such as inlining can still be applied.
- The transform produces code to compute a lot of gradients that we have no use for, for example gradients with respect to constant arguments. This is something that inlining and dead code elimination can deal with.
基于AST的好处
- Since these graphs are essentially a modified form of an abstract syntax tree(AST), we consider the transformations and analyses that have been performed on computation graphs as program transforms and program analyses. While other DL frameworks also adopt this perspective, their graph-based approaches have made it difficult to bring the full arsenal of traditional compiler and programming languages techniques to bear.
- Most traditional source code transformation methods use a stack (tape) to store intermediate values on at runtime. This introduces a mutable runtime structure into the program, which complicates type inference and optimization. Higher-order gradients are complicated by the fact that the gradient transform must explicitly handle read and write operations on this tape. If, on the other hand, the transform produces a program without side-effects and valid in the original formulation, then it should be possible to get higher order gradients simply by applying the transform repeatedly. Since we do not introduce explicit runtime data structures, all regular optimization methods will remain valid and effective.
A Type System
In this paradigm, knowing all values are tensors allows compiler writers to design and implement optimizations over the AST in a uniform manner. For example, if a user of Relay wants to write an optimization that lifts a computation up one dimension, they can uniformly add a dimension without needing to handle scalar cases. This is very useful for optimizations that change dimension(e.g. auto-batching, spatial packing, or layout changes).
函数式表达式示例
Functional expression
Figure: Transform of a simple Python program into Myia’s functional representation. The original function (left) is split into three functions or basic blocks (right): the main body, the condition and the loop body. The latter two take as parameters the free variables used and set by the loop, thus the ostensibly imperative code on the left is successfully converted to a functional representation. The Python syntax used on the right does not correspond to Myia’s actual representation and is only illustrative.