Multi objective Least Squares and Applications - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Author: Bruce Qu Revisions by: Neil Kane

Introduction

This is an introductory article designed to provide readers with some basic programming background to the multi-objective least squares problem. The Table of Contents:

What is Least Squares?
Why would you care?
Preliminaries
Multi-objective least squares
Regularized data fitting
Different Flavors and Extensions
An interesting Application
Conclusion

What is Least Squares?

Before diving into the in depth mathematics let’s take a few steps back and understand from a high level what least squares is. From its name the idea is to minimize the sum of the squares of some quantity. Specifically this quantity is generally some metric that we want to penalize or in other words we prefer quantities that are smaller over those that are larger. In this way the end goal becomes to find certain parameters that give us the least sum of squares or simply least squares. These quantities are usually things like the distance between data points and our line (our hyperplane in higher dimensions). The fundamental idea being to determine some mapping between data points in one dimension with outputs in another.

Why would you care?

Least squares problems, especially multi-objective least squares problems, are important in many fields of science, engineering, and data analysis. In this wiki article, we can also see that it can be applied to achieve noise reduction. Least squares can be used to filter out the noise and recover a signal that best represents the "ground-truth" data. Towards the end of this article, we will apply the multi-objective least squares concept to an image-deblurring task. Without further ado, let's dive right in!

Preliminaries

Before we start, we need to review some fundamental concepts that we will use later on.

Least Squares

Least Squares

In this section, we will briefly go over the least square problem, how we can solve this type of problem, and its solution.

Problem: Given $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{m}$, find a vector $x \in \mathbb{R}^n$ that minimizes

$$||Ax-b||^2=\sum_{i=1}^{m}(\sum_{j=1}^{n}A_{ij}x_j-b_i)^2$$

The term "least squares" comes from the summation of squares of a function $$\sum_{j=1}^{n}A_{ij}x_j-b_i$$

Example: For example, let

$$A = \begin{bmatrix} 2 & 0 \\ -1 & 1 \\ 0 & 2 \\ \end{bmatrix}, \quad b = \begin{bmatrix} 1 \\ 0 \\ -1 \\ \end{bmatrix} $$

The least squares solution $\hat{x}$ minimizes $$f(x) = ||Ax-b||^2 = (2x_1-1)^2 + (-x_1 + x_2)^2 + (2x_2 + 1)^2$$

To find $\hat{x}$, set the partial derivatives with respect to $x_1$ and $x_2$ to zero:

$$\begin{align*} 10x_1 - 2x_2 - 4 &= 0 \\ -2x_1 + 10x_2 + 4 &= 0 \end{align*} $$

The solution is $(\hat{x}_1, \hat{x}_2) = (1/3, -1/3)$.

Summary: In general, we formulate a least squares problem in the following format.

Find any $\hat{x}$ that satisfies $||A\hat{x} - b|| \leq ||Ax-b||$ for all $x$

$$\text{Find any } \hat{x} \enspace \text{that satisfies} \enspace ||A\hat{x} - b|| \leq ||Ax-b|| \enspace \text{for all} \enspace x$$

To make it more compact, let $\hat{r} = A\hat{x} -b$ be the residual vector. A perfect solution for this problem will minimize $\hat{r}$ to 0. If $\hat{r} \neq 0$, then $\hat{x}$ is a least squares approximate solution of the set of equations.

Solution: If we assume that $A$ has linearly independent columns, then the vector

$$\begin{align*} \hat{x} &= (A^TA)^{-1}A^Tb \\ &= A^\dagger b \end{align*}$$

Note that this is the unique solution of the least squares problem $\text{minimize} \enspace ||Ax-b||^2$

Proof: There are many ways to prove that the previous answer is the correct solution to a general least squares problem. To accommodate readers who are not familiar with basic linear algebra, we will approach the solution from a calculus perspective.

Keep in mind that we want to minimize the function $$f(x) = ||Ax-b||^2=\sum_{i=1}^{m}(\sum_{j=1}^{n}A_{ij}x_j-b_i)^2$$

We can then take the partial derivative of $f$ with respect to $x_k$

$$\frac{\partial{f}}{\partial{x_k}}(x) = 2\sum_{i=1}^{m}{A_{ik}}(\sum_{j=1}^{n}{A_{ij}x_j - b_i})= 2(A^T(Ax-b))_k$$

We can apply this concept to every term in the gradient of $f$, which is defined as $$\nabla f(x) = \left(\frac{\partial{f}}{\partial{x_1}}(x), \frac{\partial{f}}{\partial{x_2}}(x), \frac{\partial{f}}{\partial{x_3}}(x), ..., \frac{\partial{f}}{\partial{x_n}}(x)\right) = 2A^T(Ax-b)$$

Solving $\nabla f(x) = 0$ gives us $\hat{x} = (A^TA)^{-1}Ab$

Now, we are ready to investigate the Multi-objective least squares problem!

Multi-objective least squares

In a multi-objective least squares problem, we want to seek one $x$ that minimizes a linear combination of $n$ objectives at the same time.

$$J_1 = ||A_1x-b_1||^2, J_2 = ||A_2x-b_2||^2, ..., J_n = ||A_nx-b_n||^2$$

For example, a weighted least squares problem can be formulated with the above definitions.

$$ \textnormal{minimize} \quad J = \lambda_1||A_1x-b_1||^2 + ... + \lambda_n||A_nx-b_n||^2$$

In general, the coefficients $\lambda_1, ..., \lambda_n$ are positive values, and they express the relative importance of different objectives. An extreme example is $\lambda_1 = 1, \lambda_{2}, ..., \lambda_{n} = 0$, which simplifies this problem to a single-objective least squares problem.

Solution: Assuming all $\lambda$ s are all positive, we can slightly rewrite the objective.

$$ \begin{aligned} \textnormal{minimize} \quad J &= \lVert\sqrt{\lambda_{1}}A_1x-\sqrt{\lambda_1}b_1 \rVert^2 + ... + \lVert\sqrt{\lambda_{n}}A_nx-\sqrt{\lambda_n}b_n \rVert^2\\ &= \sum_{k=1}^{n} \lVert\sqrt{\lambda_{k}}A_kx-\sqrt{\lambda_k}b_k \rVert^2\\ &= \lVert \begin{bmatrix} \sqrt{\lambda_{1}}A_1 \\ \sqrt{\lambda_{2}}A_2 \\ \vdots\\ \sqrt{\lambda_{n}}A_n \end{bmatrix} x - \begin{bmatrix} \sqrt{\lambda_{1}}b_1 \\ \sqrt{\lambda_{2}}b_2 \\ \vdots\\ \sqrt{\lambda_{n}}b_n \end{bmatrix} \rVert^2 \end{aligned} $$

We can see that in the final form, it looks exactly like a least squares problem. However, if the stacked matrix has linearly dependent columns, we have multiple solutions. Also note that we do not enforce all matrices $A$ to be full column rank, i.e. have linearly independent columns.

Hence the solution to the above least squares problem can be derived from a single-objective least squares problem.

$$\hat{x} = (\lambda_1A_1^TA_1+...+\lambda_kA_k^TA_k)^{-1}(\lambda_1A_1^Tb_1+...+\lambda_kA_k^Tb_k)$$

Regularized data fitting

Motivation: Occam's Razor highlights the idea that simple explanations are better than more complex ones. This same concept is often applied to the models that we try to determine using techniques like least squares. Sometimes the resulting values for our mapping can be too “complex” and in this regard we would like to penalize this in some way. Thus this motivates the idea of regularization which is one of the most widely used cases of multi objective least squares. Specifically we want to add another quantity, namely the actual parameters that we are solving for. There are many options here such as taking absolute values or squaring them like in least squares (for those that are familiar these are often called L1 and L2 regularizers) but in essence we prefer smaller (simpler) models over more larger (complex) ones. As a final added benefit the closed form solution of least squares for linear cases relies on linearly independent columns and adding L2 regularization is a simple way to ensure this criteria is always met. We can now leverage the solution presented in the previous section to explore an example of regularized data fitting.

Problem: We have data points $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), (x^{(3)}, y^{(3)}), ..., (x^{(N)}, y^{(N)})$ and model $$\hat{f}(x)=\theta_1f_1(x)+...+\theta_pf_p(x)$$

If $\hat{f}_k(x)$ is a high-order polynomial of x, then a large $\theta_k$ will amplify the perturbations in $x$. This will result in a large variance in $\hat{f}(x)$, hence an overfitted model. We will later see how this is useful in image deblurring.

Now, our problem becomes two-fold:

We want to fit the model $\hat{f}(x)$ to data points $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), (x^{(3)}, y^{(3)}), ..., (x^{(N)}, y^{(N)})$, i.e. minimize the difference between our prediction and ground-truth.
We want to keep $\theta_1, \theta_2, ..., \theta_p$ small to avoid over-fitting.

Method: We can easily formulate this problem into a multi-objective least squares problem. (L2 Regularization)

$$J_1(\theta) = \sum_{k=1}^{N}(\hat{f}(x^{(k)}) - y^{(k)})^2,\quad J_2(\theta) = \sum_{j=1}^{p}\theta_j^2$$

Depending on the strength of regularization, we can increase/decrease the value of a regularization coefficient $\lambda$. That is,

$$ \begin{aligned} \textnormal{minimize} \quad J_1(\theta) + \lambda J_2(\theta) &= \sum_{k=1}^{N}(\hat{f}(x^{(k)}) - y^{(k)})^2 + \lambda \sum_{j=1}^{p}\theta_j^2 \\ &= \lVert \begin{bmatrix} A_1 \\ \sqrt{\lambda}A_2 \\ \end{bmatrix} \theta - \begin{bmatrix} y^d\\ 0\\ \end{bmatrix} \rVert^2 \end{aligned} $$

$$\textnormal{with} \quad y^d = (y^{(1)}, ..., y^{(N)}), \quad A_1 = \begin{bmatrix} 1 & f_2(x^{(1)}) & \cdots & f_p(x^{(1)}) \\ 1 & f_2(x^{(2)}) & \cdots & f_p(x^{(2)}) \\ \vdots & \vdots & \ddots & \vdots \\ 1 & f_2(x^{(N)}) & \cdots & f_p(x^{(N)}) \\ \end{bmatrix}, \quad A_2 = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \\ \end{bmatrix} $$

Solution: Similar to the previous multi-objective least squares problem:

$$\hat{x} = (A_1^TA_1 + \lambda A_2^TA_2)^{-1}(A_1^Ty^d)$$

Different Flavors and Extensions

In the prior example we explored the cases where each of the dimensions of our input data were treated equally resulting in a linear relationship. However this is not the only type of least squares that we can explore. In fact we can also decide to augment our data by raising certain dimensions to different powers, performing a log operation or really anything we desire. There may not necessarily be closed form solutions from a linear algebra perspective however these problems are still of interest as it is not always that case that our data is related linearly with the outputs. If we have a parabolic curve of data it doesn’t make sense to approximate it with a line for example.

The main limitations of these extra cases is that we cannot easily get a solution like we can in the linear case. This however does have a strong relationship to many machine learning techniques as their goal is also to minimize quantities and go about this task with other approaches often based on gradient descent. Specifically the concept of squaring is analogous to looking at L2 norms and is popular to it being differentiable and leading to nice closed form solutions. It is by no means the only option and there is a lot of flexibility available over how exactly the quantity is determined as we could have just as easily considered taking absolute values combined with augmenting our data by considering polynomial forms. These are just some of the many approaches one can take and the specific option is picked through empirical trial and error mixed with known information regarding the data. Hopefully these article has given some more insight on the various least squares approaches and if you want to learn more about this concept some other related topics could be when specific constraints are applied to our solutions (constrained least squares) or other machine learning topics that build upon this idea of minimizing cost.

An interesting application

In this part, we will briefly discuss how we can treat some problems in real life as multi-objective least squares problems.

An interesting application of estimation (or multi-objective least squares problem) is image deblurring/denoising.

Let $x_{ex}$ be an unknown image and y be an observed image.

In the MATLAB code below, we load in an image from the USC-SIPI Image Database at http://sipi.usc.edu/database.

% MATLAB script
using MAT, ImageView
f = matopen("deblur.mat");
Y = read(f, "Y");
B = read(f, "B");
imshow(Y);

Fig.1 blurry image Y

Problem. In this image deblurring problem, we are given a noisy and blurred image $Y$, which comes from a clear yet unknown image $x_{ex}$. We can model this transformation as $Y = Ax_{ex} + n$, where $A$ is a known blurring matrix and $n$ is unknown noise.

Method. We will try to construct $\hat{x}$ so that 1) we can get a denoised image, and 2) the image doesn't seem blurry.

We will introduce the cost function/objective:

$$\textnormal{minimize} \quad J_1 + J_2 = \lVert Ax - y \rVert^2 + \lambda(\lVert D_vx\rVert^2 + \lVert D_hx \rVert^2) $$

$$\textnormal{where} \quad \lVert D_vx\rVert^2 + \lVert D_hx \rVert^2 = \sum_{i=1}^{M}\sum_{j=1}^{N-1}(X_{i, j+1}-X_{i,j})^2+\sum_{i=1}^{M-1}\sum_{j=1}^{N}(X_{i+1, j}-X_{i,j})^2$$

Intuition. The term $\lVert D_vx\rVert^2 + \lVert D_hx \rVert^2$ represents the sum of squared differences between values at neighboring (both vertical and horizontal) pixels. If this term is small, then it means that the neighboring pixels transition smoothly. If this term is large, then $\hat{x}$ would look like an mosaic image.

% MATLAB script (cont.)
E = [1, zeros(1, 1023); zeros(1022, 1024); -1, zeros(1, 1023)];
D = @(lambda) abs(fft2(B)).^2 + lambda.*abs(fft2(E)).^2 + lambda.*abs(fft2(E')).^2;

for i=-6:2:0
  X = ifft2((conj(fft2(B)).*fft2(Y))./D(10.^i));
  figure();
  imshow(X);
  str = sprinf('lambda=%d',i);
  title(str);
end

Fig.2 Deblurred images

Analysis: In the above code, we used Fast Fourier Transform and Inverse Fast Fourier Transform to help us deblur the image. This is out of the scope of this wiki. We notice that when $\lambda = 10^{-6}$, the deblurred image looks like a mosaic image. It's similar to "overfitting" in regression problems as this cost function disregards the variance of pixel differences. On the other hand, when $\lambda = 1$, the cost function penalizes any abrupt changes in neighboring pixel values, resulting in a "smoothed" image.

Conclusion

In the wiki blog, we established the multi-objective least squares problem, derived the solution, and introduced the regularization term. We also explored other variations to expand on this idea before considering a real world example. Specifically, we treated the image deblurring task as a multi-objective least squares problem.

Reference

Boyd, S., & Vandenberghe, L. (2019). Chapter15: Multi-objective Least Squares. In Introduction to applied linear algebra: Vectors, matrices, and least squares (pp. 309–325). essay, Cambridge University Press.

Unclear, U. (n.d.). Volume1: Mosaics. Sipi Image Database. Retrieved February 10, 2023, from http://sipi.usc.edu/database

Johari, A. (2020, May 13). A 101 guide on the least squares regression method. Medium. Retrieved February 10, 2023, from https://medium.com/edureka/least-square-regression-40b59cca8ea7