Math - RicoJia/notes GitHub Wiki
========================================================================
========================================================================
-
factorial(0) = 1
-
0^0=1
-
Evaluating polynomials:
- Qin Jiu Shao (horner's):
((x+1)*x+2)*3 + ...
minimizes number of multiplication. Addition is a lot faster than multiplication. - Estrin's Scheme.
(1+x) + (2+bx)*x^2 + (5+cv)*x^4 ...
, can be parallelized, like divide and conquer
- Qin Jiu Shao (horner's):
-
Distance bw two parallel lines:
W^Tx + b1 = 0
,W^Tx + b2 = 0
, distance =(b2-b1)/||w||
wT x1 = b1 wT x2 = b2 projection on normal vector, w: w(x1 - x2)/||w|| = (b1-b2)/||w||
-
In numerically stable sigmoid:
1/(1+exp(-x)) # works with x -> +inf, so it goes to 1, but when x-> -inf, it should go to zero, it may overflow and x -> 1.
- sol:
exp(x)/(1+exp(x)) for x<0
========================================================================
========================================================================
-
Gradient of vectors
-
Gateux Direvative
-
Gateux Direvative, aka directional direvative.
-
examples.
-
Differentiability doesn't always match the derivative
-
-
Scaler - Vector, Vector - Vector, Matrix - Vector
- Reference
- Vector - Vector, row/ cln vectors matter!
-
Definition
-
Key rules
-
Examples
-
========================================================================
========================================================================
-
- Differentiate Scalar w.r.t Vector:
- Case 1 $$ f(x) = Ax = [A_0x_0 + A_1x_1...] \ \frac{\partial(f(x))}{\partial(x)} = [A_0 | A_1 ...] = A $$
- Case 2 $$ f(x) = x^TA = [x^TA_0 | x^TA_1 ...] = [x^TA_0, x^TA_1 ...] \ \frac{\partial(f(x))}{\partial(x)} = \begin{bmatrix} a_{0,0} & a_{1,0} ... \ a_{0,1} & a_{1,1} ... \ ... \end{bmatrix} = A^T $$
- Case 3: $$ f(x) = x^TAx = x_0(a_{00}x_0 + a_{01}x_1 ...) + x_1(a_{10}x_0 + a_{11}x_1 + ...) \ = x_0(a_{00}x_0 + a_{01}x_1 + ... a_{10}x_1 + a_{20}x_2 ...) + ... \ \frac{\partial(f(x))}{\partial(x)} = [\sum_j a_{j0}x_j | \sum_j a_{j1}x_1] + [\sum_i a_{0i}x_i | \sum_i a_{1i}x_i] = A^Tx + Ax $$
- Differentiate Scalar w.r.t Vector:
-
Condition Number
- Definition: largest / smallest eigen value. $$ c = \frac{\lambda_{max}}{\lambda_{min}} $$
- When one eigen value is 0, we have a singular matrix.
- Example of poorly conditioned matrix Assume we have A being nearly singular
$$
A = \begin{bmatrix}
2, 2.01,\
2, 2.00
\end{bmatrix}
$$
- Then,
$Ax=b$ could have very unstable$x$ , where small changes in$b$ would result in huge changes in b. - There're errors in modern computers. So that's bad. You can find this in Gauss Newton method.
- Then,
- For an ill-conditioned matrix, its eigen ellipsoid is very flat.
-
Cross product can be written as matrix multiplication
-
Hessian Matrix is the quadratic matrix Q in
P^T Q P
-
Cholesky factorial to solve eqns
-
(AB)^-1 = B^-1 * A^-1
-
1D convolution: just flip the kernel, then slide it one by one.
-
2D convolution: Same. But if you got two vectors like these
-
Ax2+Bxy+Cy2=1
is the generic ellipse, centered at origin. You may get parabolas, if A,B,C doesn't satisfy the "positive definite" condition: for all x,y, Ax2+By2+Cy2 > 0. It can be written as a matrix, M. -
Lemma: M's eigen vectors must be perpendicular to each other, if the two eigen vectors are distinct
-
Can show the smaller eigen vector corresponds to the major axis, because it's yields the shortest possible vector in the ellipse.
-
Can also show the product of an ellipse's two eigen values is det(M)
========================================================================
========================================================================
-
Lagrange Multiplier
- Goal: maximizea
$f(x,y)$ , where$xy+1$, given constraint$g(x,y)=c$ , where$x^2+y^2=1$ - Geometric Intuition: the value of
f(x,y)
and the constraint it must stay on are tangent to each other. That is, a small perturbation along the constraint curve will not cause change in the value function, hence a potential extrema is achieved.
- So this is equivalent to:
$L=f(x)-\lambda g(x)$ , and get$[\frac{\partial{L}}{\partial{x}}, \frac{\partial{L}}{\partial{\lambda}}] = 0$
- Goal: maximizea
-
convex function and convex set
Least Squares Minimization is a standard approach for computing a set of solutions that minimizes total squared error of an "overdetermined system". An over determined system where you have more equations than unknown variables: $$ 2x + 3y = 26 x + y = 10 x - y =2 $$
- Write the system of equations in $$ A=\begin{bmatrix} 2, 3, \ 1, 1 \ 1, -1 \end{bmatrix} \ x - [x,y] \ b = [26, 10, 2] \ Ax = b $$
- Solve the problems using $$ argmin(|Ax-b|^2) \ x^TA^TAx - 2(Ax)^Tb + b^Tb \ x = (A^TA)^{-1} A^Tb $$
- Advantages vs Disadvantages:
- Advantages: native solution to linear cost function
$|Ax-b|^2$ - Disadvantages: In large dataset, getting
$A^TA$ is very expensive. Also, when there's new data coming, you can apply regularization on gradient descent easily.
- Advantages: native solution to linear cost function
Problem formulation: for Gradient descent, the problem is minimization of the least squares problem:
- Gradient Descent:
- Advantages: easy to compute. For SGD, only one sample in a batch is needed. So it's fast to compute. Also, since the variance between single updates is higher than that of mini-batch gradient, we might be able to jump from one local minima to the global minimum
- Disadvantages: oscillation near local minima.
- So, momentum is added to counteract the sudden changes in gradient.
-
Newton-Raphson (solution to non-linear function) $$ f(x) = 0 \ xf'(x_0) + f(x_0) - x_0f'(x_0) = 0 \ x_0 = x_n - f(x_n)/f'(x_n) $$
- This is one of the fastest converging function out there, And we are already adjusting the step size
-
Gauss-Newton (for least squres problem, using Newton's method to get zero derivative) $$ x = argmin(F(x)^2) \ F(x) = f(x_0) + J(x)\Delta x \ \frac{\partial F(x)^2}{\partial \Delta x} = 2J^TJ \Delta x + 2Jf(x_0) \ \Delta x = -(J^TJ)^{-1}Jf(x_0), where H = (J^TJ) $$
- Disadvantages:
- H should be positive definite, but in reality it could be semi-positive definite. So (JTJ)-1 could be unstable.This is to adjust the step size
- As with LM, GN is sensitive to initial values.
- Need to calculate Jacobians.
- Disadvantages:
-
LM: add a regularization term to penalize on the absolute value of delta x. $$ \Delta x = argmin(|f(x_0) + J(x_0) \Delta x|^2 + \frac{\mu}{2}(\Delta x)^2) \ \frac{\partial F}{\partial \Delta x} = 2J(x_0)^TJ\Delta x + 2J^Tf(x_0) + \mu \Delta x \ \Delta x = -(J^TJ + \mu I)^{-1}J^Tf(x_0) $$
-
Initialize
$\mu$ to a small positive value. In one iteration- Calculate
$J(x_{new})$ after the update$\Delta x$ - If the expected change in f w.r.t the new change,
actual improvement of cost / predicted improvement of cost
is within a region, (trust region), we can adjust$\mu$ accordingly. A trust region fo LM, is$\rho = \frac{J(x_{new}) - J(x_0)}{\Delta x^T (J^TJ \Delta x + J^Tr)}$ .- If
$\rho \approx 1 $ . Then there's good agreement between the model prediction (with jacobians) and actual values. Then Accept$x_{new}$ , decrease$\mu$ :$\mu = \frac{\mu}{\beta}$ - If
$\rho$ is smaller but above a threshold, then keep the$\rho$ and accept $x_{new} - If
$\rho$ is close to one, or even negative. Reject$x_{new}$ , increase$\mu$ :$\mu = \mu \beta$
- If
- The trust region is about assessing the agreement between the model and the actual objective function.
- Calculate
-
Advantage: LM is a lot more robust than GN. Why would regularization work?
- When
$H \approx J^TJ$ is ill-conditioned,$\mu$ can stablize step size$\mu$ - Gradient descent is more robust but slower. Gauss-Newton is more adaptive but more unstable. The larger the
$\mu$ , the more "gradient descent" like behavior this indicates (that is, we don't have approximation of Hessian using$J^TJ$ ). It's good when we are far from a minimum. The smaller the$\mu$ , then more "Gauss-Newton" like behavior is observed.
- When
-
Disadvantages:
-
$\mu$ - one more parameter to tune. In Neural Networks, regularization parameter stays the same
-
-
-
Powell's DogLeg: an explicit combo of GN and gradient descent.
- Steepest descent:
$F(x) = 1/2 |r(x)|^2$ . So update would be:$\Delta x_{sd} = - \alpha J^T(x) r(x)$ - GN:
$F(x) = 1/2 |r(x_0) + r'(x_0)\Delta x|^2$ so update would be$\Delta x_{gn} = -(J^T(x_0)J(x_0))^{-1}J(x_0)f(x)$ - Make update
- If
$|\Delta x_{gn}| < \Delta_{gn}$ , then$\Delta = |\Delta x_{gn}$ - If
$|\Delta x_{gd}| > \Delta_{sd}$ , then$\Delta = \Delta_{gn} \frac{\Delta x_{sd}}{|\Delta x_{sd}|}$ - Otherwise, find
$\tau$ such that$||\Delta x_{sd} + \tau(\Delta x_{gn}-\Delta x_{sd})|| = \Delta$ , and set$\Delta x = \Delta x_{sd} + \tau(\Delta x_{gn}-\Delta x_{sd})$
- If
- Good for scenario when Hessian as in computational cost of LM is prohibitive
- Steepest descent:
- when
$J^T(x_0)f(x_0) = 0$ , GN basically stops. That happens when we have converged, either to a saddle point, or a minimum - When
$J^T(x_0)J(x_0)$ is singular / near singular, the cost function$F(x_0)$ is likely in flat region, where there are infinitely many gradient directions. In that case, it's ill-conditioned (see the condition number section)
-
np.polyfit
is better thanscipy.curve_fit(non-linear, LM)
-
scipy.learn.leastsquares
can do pretty much any levenberg-marquardt optimization. see code- inside it has linear and non-linear solvers
- Note that you return residual rather than the square itself. x is (1,n) array, residual is (1,m) array
-
result = least_squares(FixtureCalibration.cost_func, pose, method = "trf", verbose = 2, max_nfev=30, ftol=1e-8, args=(marker_id, imgs_proj, all_detected_corners, self.camera_matrix, self.marker_size_m, P_origin_cam_dict))
========================================================================
========================================================================
- SVD
-
predecessor to PCA. In recommendation systems, SVD was used a lit. can be used to reduce dimensions
-
Eigen value decomposition.
mx1
vector can be represented in the eigen space ofA
-
Singular Value Decomposition: For a
mxn
matrix A withrank(A) = k
, first we can get the distinctnx1
eigenvectors ofA'A
,V
. Then, we can get an orthogonal set ofmx1
vectors,U
-
Then we complete an orthogonal basis of
mxm
withV
, and an orthogonal basis ofnxn
withU
. Nice thing about SVD is: for anx1
vector x,Ax
can be calculated equivalently with akx1
vector, so that's data compression -
Do we need to calculate
u=Av
everytime? No, turns outu
is the eigenspace + nullspace ofAA'
-
========================================================================
========================================================================
-
Basic Properties: 凤姐咬你(封闭,结合律 a+(b+c), 幺元(Identity), 逆
-
4 ways to represent rotation:
- Rotation matrix (not closed for addition)
- Lie Algebra (李代数), unit vector space of rotation axes
- Quaternion
- Euler angles
- Gimbal lock: rotation in 3D space can be represented in the frames of 3 gimbals. But when the three gimbals are on the same planes, rotations in only 2 dimensions can be represented.
- Gimbal lock: rotation in 3D space can be represented in the frames of 3 gimbals. But when the three gimbals are on the same planes, rotations in only 2 dimensions can be represented.
-
Inverse:
R*R^T = I
-
-
Derivative of R
-
hat operator and skew symmetric matrices
-
R can be represented as
exp(W^ t)
, which is a Lie group (李群, smooth because of t)-
R'R^T
is skew symmetric, ifR
hastheta
-
so
R(theta) = exp(W^ theta)
, 注:t below should betheta
-
-
d[exp(W^ theta)]/dt = W^*exp(W^ theta)
-
Rodrigues formula, represent
exp(W^ theta)
using theta andW^
-
Using these properties
-
Derivative of R:
phi = ln(R)
-
-
-
T
's representation: SE(3), whereksi
is[p1, p2, p3, theta1, theta2, theta3]
-
T can be represented as matrix exp, too. matrix exponential:
exp(A) = sum(1/factorial(n) * A^n)
-
Proof: write it out
-
using the same properties (note
phi
isw
above), you can prove it
-
========================================================================
========================================================================
-
Any single variate Gaussian Distribution can be written as a form of
Z ~ N(0,1)
(aka standard Gaussian Distribution). And this means we can use the relative distance: sigma to represent a value in its distribution -
Standard Multivariate Distribution: IID (Independent & identically distributed) distribution
Z = [z1, z2 ... zn]
isN(0, I)
-
Arbitrary Multivariate Distribution.
X = [x1, x2 ... xn]
, they are not necessarily independent to each other. But we have theorem: any X has a full-rank matrix B that can make Z = B^-1 * X. So we can write out the PDF ofZ
. Also, we know the covariance ofX
is the same as that ofZ
-
Then thru cdf, we can replace
z
withx
, and write out pdf ofx
.
========================================================================
========================================================================
- Bhattacharya distance: better than Mahalanobis Distance (whose standard deviation is the same.)
=
-ln(bhattacharya_distance)
- Bhattacharya Coefficient: