Math - RicoJia/notes GitHub Wiki

========================================================================

Basics

========================================================================

factorial(0) = 1
0^0=1
Evaluating polynomials:
- Qin Jiu Shao (horner's):((x+1)*x+2)*3 + ... minimizes number of multiplication. Addition is a lot faster than multiplication.
- Estrin's Scheme.
  (1+x) + (2+bx)*x^2 + (5+cv)*x^4 ..., can be parallelized, like divide and conquer

Distance bw two parallel lines: W^Tx + b1 = 0, W^Tx + b2 = 0, distance = (b2-b1)/||w||

wT x1 = b1
wT x2 = b2
projection on normal vector, w: w(x1 - x2)/||w|| = (b1-b2)/||w||

In numerically stable sigmoid:

1/(1+exp(-x)) # works with x -> +inf, so it goes to 1, but when x-> -inf, it should go to zero, it may overflow and x -> 1.

sol:

exp(x)/(1+exp(x)) for x<0

========================================================================

Basic Calculus

========================================================================

Direvatives

Gradient of vectors
- Look at them element-wise, proof is here, and here
- In many cases AT = A
- Is gradient always the steepest descent? Yes. But we need to be clear about the function. Because for x+y=1 in a 3d space, this is a vertical plane, and its gradient (1,1) is never really achievable
Gateux Direvative
- Gateux Direvative, aka directional direvative.
  
  Small Perterbation with epsilon - Directional Direvative
- examples.
- Differentiability doesn't always match the derivative
  
  Differentiability != The existence of directional Derivative
Scaler - Vector, Vector - Vector, Matrix - Vector
- Reference
- Vector - Vector, row/ cln vectors matter!
  1. Definition
  2. Key rules
  3. Examples

========================================================================

Linear Algebra

========================================================================

Good Linear Algebra Book
- Differentiate Scalar w.r.t Vector:
  - Case 1 $$ f(x) = Ax = [A_0x_0 + A_1x_1...] \ \frac{\partial(f(x))}{\partial(x)} = [A_0 | A_1 ...] = A $$
  - Case 2 $$ f(x) = x^TA = [x^TA_0 | x^TA_1 ...] = [x^TA_0, x^TA_1 ...] \ \frac{\partial(f(x))}{\partial(x)} = \begin{bmatrix} a_{0,0} & a_{1,0} ... \ a_{0,1} & a_{1,1} ... \ ... \end{bmatrix} = A^T $$
  - Case 3: $$ f(x) = x^TAx = x_0(a_{00}x_0 + a_{01}x_1 ...) + x_1(a_{10}x_0 + a_{11}x_1 + ...) \ = x_0(a_{00}x_0 + a_{01}x_1 + ... a_{10}x_1 + a_{20}x_2 ...) + ... \ \frac{\partial(f(x))}{\partial(x)} = [\sum_j a_{j0}x_j | \sum_j a_{j1}x_1] + [\sum_i a_{0i}x_i | \sum_i a_{1i}x_i] = A^Tx + Ax $$
Condition Number
- Definition: largest / smallest eigen value. $$ c = \frac{\lambda_{max}}{\lambda_{min}} $$
- When one eigen value is 0, we have a singular matrix.
- Example of poorly conditioned matrix Assume we have A being nearly singular $$ A = \begin{bmatrix} 2, 2.01,\ 2, 2.00 \end{bmatrix} $$
  - Then, $Ax=b$ could have very unstable $x$, where small changes in $b$ would result in huge changes in b.
  - There're errors in modern computers. So that's bad. You can find this in Gauss Newton method.
- For an ill-conditioned matrix, its eigen ellipsoid is very flat.
Cross product can be written as matrix multiplication
Hessian Matrix is the quadratic matrix Q in P^T Q P
Cholesky factorial to solve eqns
- algorithm
- In Eigen
(AB)^-1 = B^-1 * A^-1

Convolution

1D convolution: just flip the kernel, then slide it one by one.
2D convolution: Same. But if you got two vectors like these

Eigen Vectors are Major and Minor Axes of an Ellipse

Ax2+Bxy+Cy2=1 is the generic ellipse, centered at origin. You may get parabolas, if A,B,C doesn't satisfy the "positive definite" condition: for all x,y, Ax2+By2+Cy2 > 0. It can be written as a matrix, M.
Lemma: M's eigen vectors must be perpendicular to each other, if the two eigen vectors are distinct
Can show the smaller eigen vector corresponds to the major axis, because it's yields the shortest possible vector in the ellipse.

caption
Can also show the product of an ellipse's two eigen values is det(M)

========================================================================

Optimization

========================================================================

Convex Optimization

convex function and convex set

Least Squares

Least Squares Minimization is a standard approach for computing a set of solutions that minimizes total squared error of an "overdetermined system". An over determined system where you have more equations than unknown variables: $$ 2x + 3y = 26 x + y = 10 x - y =2 $$

Write the system of equations in $$ A=\begin{bmatrix} 2, 3, \ 1, 1 \ 1, -1 \end{bmatrix} \ x - [x,y] \ b = [26, 10, 2] \ Ax = b $$
Solve the problems using $$ argmin(|Ax-b|^2) \ x^TA^TAx - 2(Ax)^Tb + b^Tb \ x = (A^TA)^{-1} A^Tb $$

Advantages vs Disadvantages:
- Advantages: native solution to linear cost function $|Ax-b|^2$
- Disadvantages: In large dataset, getting $A^TA$ is very expensive. Also, when there's new data coming, you can apply regularization on gradient descent easily.

Gradient Based Optimization

Problem formulation: for Gradient descent, the problem is minimization of the least squares problem: $|F(x)|^2$. For others, above least squares problem is for the linear case, where $F(x) = Ax-b$.

Gradient Descent:

Advantages: easy to compute. For SGD, only one sample in a batch is needed. So it's fast to compute. Also, since the variance between single updates is higher than that of mini-batch gradient, we might be able to jump from one local minima to the global minimum
Disadvantages: oscillation near local minima.
- So, momentum is added to counteract the sudden changes in gradient.

Newton-Raphson (solution to non-linear function) $$ f(x) = 0 \ xf'(x_0) + f(x_0) - x_0f'(x_0) = 0 \ x_0 = x_n - f(x_n)/f'(x_n) $$
- This is one of the fastest converging function out there, And we are already adjusting the step size
Gauss-Newton (for least squres problem, using Newton's method to get zero derivative) $$ x = argmin(F(x)^2) \ F(x) = f(x_0) + J(x)\Delta x \ \frac{\partial F(x)^2}{\partial \Delta x} = 2J^TJ \Delta x + 2Jf(x_0) \ \Delta x = -(J^TJ)^{-1}Jf(x_0), where H = (J^TJ) $$
- Disadvantages:
  - H should be positive definite, but in reality it could be semi-positive definite. So (JTJ)-1 could be unstable.This is to adjust the step size
  - As with LM, GN is sensitive to initial values.
  - Need to calculate Jacobians.
LM: add a regularization term to penalize on the absolute value of delta x. $$ \Delta x = argmin(|f(x_0) + J(x_0) \Delta x|^2 + \frac{\mu}{2}(\Delta x)^2) \ \frac{\partial F}{\partial \Delta x} = 2J(x_0)^TJ\Delta x + 2J^Tf(x_0) + \mu \Delta x \ \Delta x = -(J^TJ + \mu I)^{-1}J^Tf(x_0) $$
- Initialize $\mu$ to a small positive value. In one iteration
  1. Calculate $J(x_{new})$ after the update $\Delta x$
  2. If the expected change in f w.r.t the new change, actual improvement of cost / predicted improvement of cost is within a region, (trust region), we can adjust $\mu$ accordingly. A trust region fo LM, is $\rho = \frac{J(x_{new}) - J(x_0)}{\Delta x^T (J^TJ \Delta x + J^Tr)}$.
    1. If $\rho \approx 1 $. Then there's good agreement between the model prediction (with jacobians) and actual values. Then Accept $x_{new}$, decrease $\mu$: $\mu = \frac{\mu}{\beta}$
    2. If $\rho$ is smaller but above a threshold, then keep the $\rho$ and accept $x_{new}
    3. If $\rho$ is close to one, or even negative. Reject $x_{new}$, increase $\mu$: $\mu = \mu \beta$
  3. The trust region is about assessing the agreement between the model and the actual objective function.
- Advantage: LM is a lot more robust than GN. Why would regularization work?
  1. When $H \approx J^TJ$ is ill-conditioned, $\mu$ can stablize step size $\mu$
  2. Gradient descent is more robust but slower. Gauss-Newton is more adaptive but more unstable. The larger the $\mu$, the more "gradient descent" like behavior this indicates (that is, we don't have approximation of Hessian using $J^TJ$). It's good when we are far from a minimum. The smaller the $\mu$, then more "Gauss-Newton" like behavior is observed.
- Disadvantages:
  - $\mu$ - one more parameter to tune. In Neural Networks, regularization parameter stays the same
Powell's DogLeg: an explicit combo of GN and gradient descent.
- Steepest descent: $F(x) = 1/2 |r(x)|^2$. So update would be: $\Delta x_{sd} = - \alpha J^T(x) r(x)$
- GN: $F(x) = 1/2 |r(x_0) + r'(x_0)\Delta x|^2$ so update would be $\Delta x_{gn} = -(J^T(x_0)J(x_0))^{-1}J(x_0)f(x)$
- Make update
  - If $|\Delta x_{gn}| < \Delta_{gn}$, then $\Delta = |\Delta x_{gn}$
  - If $|\Delta x_{gd}| > \Delta_{sd}$, then $\Delta = \Delta_{gn} \frac{\Delta x_{sd}}{|\Delta x_{sd}|}$
  - Otherwise, find $\tau$ such that $||\Delta x_{sd} + \tau(\Delta x_{gn}-\Delta x_{sd})|| = \Delta$, and set $\Delta x = \Delta x_{sd} + \tau(\Delta x_{gn}-\Delta x_{sd})$
- Good for scenario when Hessian as in computational cost of LM is prohibitive

When Could Gauss-Newton and LM go crazy

when $J^T(x_0)f(x_0) = 0$, GN basically stops. That happens when we have converged, either to a saddle point, or a minimum
When $J^T(x_0)J(x_0)$ is singular / near singular, the cost function $F(x_0)$ is likely in flat region, where there are infinitely many gradient directions. In that case, it's ill-conditioned (see the condition number section)

Program

np.polyfit is better than scipy.curve_fit(non-linear, LM)
scipy.learn.leastsquares can do pretty much any levenberg-marquardt optimization. see code
- inside it has linear and non-linear solvers
- Note that you return residual rather than the square itself. x is (1,n) array, residual is (1,m) array
- result = least_squares(FixtureCalibration.cost_func, pose, method = "trf", verbose = 2, max_nfev=30, ftol=1e-8, args=(marker_id, imgs_proj, all_detected_corners, self.camera_matrix, self.marker_size_m, P_origin_cam_dict)) ========================================================================

PCA, SVD

========================================================================

SVD
1. predecessor to PCA. In recommendation systems, SVD was used a lit. can be used to reduce dimensions
2. Eigen value decomposition. mx1 vector can be represented in the eigen space of A
3. Singular Value Decomposition: For a mxn matrix A with rank(A) = k, first we can get the distinct nx1 eigenvectors of A'A, V. Then, we can get an orthogonal set of mx1 vectors, U
4. Then we complete an orthogonal basis of mxm with V, and an orthogonal basis of nxn with U. Nice thing about SVD is: for a nx1 vector x, Ax can be calculated equivalently with a kx1 vector, so that's data compression
5. Do we need to calculate u=Av everytime? No, turns out u is the eigenspace + nullspace of AA'

========================================================================

Lee Algebra

========================================================================

Lee Algebra
- Theory
  - Nice Intro
  - Rotation Matrix Derivative
Basic Properties: 凤姐咬你（封闭，结合律 a+(b+c), 幺元（Identity), 逆
- 4 ways to represent rotation:
  1. Rotation matrix (not closed for addition)
  2. Lie Algebra (李代数), unit vector space of rotation axes
  3. Quaternion
  4. Euler angles
    - Gimbal lock: rotation in 3D space can be represented in the frames of 3 gimbals. But when the three gimbals are on the same planes, rotations in only 2 dimensions can be represented.
- Inverse: R*R^T = I
Derivative of R
1. hat operator and skew symmetric matrices
2. R can be represented as exp(W^ t), which is a Lie group (李群, smooth because of t)
  - R'R^T is skew symmetric, if R has theta
  - so R(theta) = exp(W^ theta), 注：t below should be theta
3. d[exp(W^ theta)]/dt = W^*exp(W^ theta)
4. Rodrigues formula, represent exp(W^ theta) using theta and W^
  - Using these properties
  - Derivative of R: phi = ln(R)
T's representation: SE(3), where ksi is [p1, p2, p3, theta1, theta2, theta3]
- T can be represented as matrix exp, too. matrix exponential: exp(A) = sum(1/factorial(n) * A^n)
- Proof: write it out
- using the same properties (note phi is w above), you can prove it

========================================================================

Probablity Theory

========================================================================

Any single variate Gaussian Distribution can be written as a form of Z ~ N(0,1) (aka standard Gaussian Distribution). And this means we can use the relative distance: sigma to represent a value in its distribution
Standard Multivariate Distribution: IID (Independent & identically distributed) distribution Z = [z1, z2 ... zn] is N(0, I)
Arbitrary Multivariate Distribution. X = [x1, x2 ... xn], they are not necessarily independent to each other. But we have theorem: any X has a full-rank matrix B that can make Z = B^-1 * X. So we can write out the PDF of Z. Also, we know the covariance of X is the same as that of Z
Then thru cdf, we can replace z with x, and write out pdf of x.

========================================================================

Misc

========================================================================

Bhattacharya distance: better than Mahalanobis Distance (whose standard deviation is the same.) = -ln(bhattacharya_distance)
Bhattacharya Coefficient:

Bhattacharya Coefficient is the sum of square root of element-wise product, and is the cosine

Math - RicoJia/notes GitHub Wiki

Basics

Basic Calculus

Direvatives

Linear Algebra

Convolution

Eigen Vectors are Major and Minor Axes of an Ellipse

Optimization

Convex Optimization

Least Squares

Gradient Based Optimization

When Could Gauss-Newton and LM go crazy

Program

PCA, SVD

Lee Algebra

Probablity Theory

Misc

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️