What to learn and introduction to problems solvable by analytical approach

Idealistic models that inspire deep network structures

What learning is all about? Or at least, what are modern methods actually doing?

In nature, data in high dimensional space has low dimensional support. Data lie precisely on geomoetric structures such as subspaces or surfaces. Some belong to a subspace, some belong to a mixture of subspaces or manifolds, most of the time we don’t know the dimension of each component. Such constraint make data highly dependent on one another. In other words, natural data is predictable. Completion is perhaps another name of prediction. Denoise and error correction, are tasks that are enabled by the fact that data has constraints.

Thus, we can say that learning is useful as it aims at, first, identifying and representing the low-dimensional structure of data. Next, we utilize the low-dimensional structure to inference for useful downstream tasks.

Why learning the model $p(x)$ is useful? - Bayesian inference point of view

Most modern machine learning methods attempt to identify the low dimensional structure of the data distribution, either by getting hold of thosse distributions analytically (e.g. PCA) or empirically(e.g. Neural Network). Why is that useful? Such learnt low-dimensional structure of data is in fact the model of the world, and in Bayesian language, the prior. Almost all practical applications can be viewed as a special case of the following inference problem: given an observation $y$ that depends on $x$,

\[\begin{align} y = h(x) + w \end{align}\]

where $h(\cdot)$ represents the measurements of a part of $x$ or its certain observed attributes and $w$ represent the measurement noise, solve the “inverse problem” of obtaining the most likely estimate $\hat{x}(y)$ of $x$. Solving this is equivalent to

\[\begin{align} \hat{x} = \arg\max_{x} p(x\mid y) \end{align}\]

And by Bayes’ rule:

\[\begin{align} p(x \mid y) = \frac{p(y \mid x) p(x)}{p(y)} \end{align}\]

Therefore we can rewrite the problem as: $\begin{align} \hat{x} = \arg\max_{x} [\log p(y \mid x) + \log p(x)] \end{align}$

Last but not the least, another application of get ahold of the conditional probability $p(x \mid y)$ is computing the conditional expectation estimate:

\[\begin{align} \hat{x} = \mathbb{E}[x \mid y] = \int x\, p(x \mid y)\, dx \end{align}\]

(Please confirm if (5) is a different application than (4)).

Our questions then become:

How to learn and represent the data
How to make use of the representation obtained from 1 to conduct useful inference.

In this series of blogs, we tackle these questions by claiming that we learn and represent the data as distribution $p(x)$ with low dimenstional structure. Then we make use of $p(x)$ for inferencing.

Learning problems that are solvable by analytical approach

Analytical approaches involve explicit low-dimensional structure being assumed while empirical do not. In this section, we discuss classical problem settings and solutions for learning models with assumption that data having geometrically (nearly, piece-wise) linear structures and statistically independent components. Such model is what we call analytical models. Such assumptions offer us efficient algorithm with provable efficiency guarantees for processing data at scale. You may ask why we are learning these models when the real world data is far more complex? We will see in the upcoming blogs that many modern deep learning architectures for complex data actually have structures that are similar to algorithm for modeling data with linear and independent structures, such as overcomplete dictionary learning.

Learning distribution with idealistic, linear models with independent structures has efficient solutions. Examples include:

PCA
ICA
Sparse Coding and Overcomplete Dictionary Learning

Distributions for the above modeling problems can be learned and represented explicitly. We will see later that general distribution does not yield efficient analytical solution and cannot be represented explicitly but they can be learnt implicitly as a denoiser.

Reminder on what we mean by explicit family of parametric model: The model class that is directly and mathematically specified, for example, “all Gaussians with arbitary mean and covariance”. When learning this type of model, we fix the form of the model and tune the set of parameters. Neural networks on the other hands, are considered empirical approaches without clearly stating analytically the form of distribution they are trying to learn. One might argue that the network architecture may limit or define the type of distribution they are able to learn, but the design process of neural networks does not involve explicitly define the form of distribution with clear mathematical descriptions.

PCA - finding the single subspace that best fits the data

In PCA, data is assumed, explicitly, to live on a single Gaussian subspace $\mathcal{S} \subseteq \mathbb{R}^D$ of dimension $d$ with basis represented by an orthonormal matrix $\mathbf{U} \in O(D, d) \subseteq \mathbb{R}^{D \times d}$ such that the columns of $\mathbf{U}$ span $\mathcal{S}$. The problem of learning thus becomes:

Given observed data ${x_i}_{i=1}^N$, finding the orthonoraml matrix $\mathbf{U} \in O(D,d)$ such that

\[\begin{align} \mathbf{x}_i = \mathbf{U} \mathbf{z}_i + \boldsymbol{\varepsilon}_i, \quad \forall i \in [N] \end{align}\]

where

\[\begin{align} \{\mathbf{z}_i\}_{i=1}^N \subseteq \mathbb{R}^d, \quad d \ll D \end{align}\]

The optimization problem for finding the optimal subspace $\mathbf{U}^*$ can be formulated as:

\[\begin{align} \arg \min_{\tilde{\mathbf{U}}} \frac{1}{N} \sum_{i=1}^{N} \| \mathbf{x}_i - \tilde{\mathbf{U}} \tilde{\mathbf{U}}^T \mathbf{x}_i \|_2^2 &= \arg \max_{\tilde{\mathbf{U}}} \frac{1}{N} \sum_{i=1}^{N} \| \tilde{\mathbf{U}}^T \mathbf{x}_i \|_F^2 \\ &= \arg \max_{\tilde{\mathbf{U}}} \text{tr} \left\{ \tilde{\mathbf{U}}^T \left( \frac{\mathbf{X} \mathbf{X}^T}{N} \right) \tilde{\mathbf{U}} \right\} \end{align}\]

That is, the optimal solution $(\mathbf{U}^, {\mathbf{z}^i}{i=1}^N)$ to the above optimization problem has $\mathbf{z}^_i = (\mathbf{U}^)^T \mathbf{x}_i$. That’s why the optimization problem above becomes a problem over $\mathbf{U}$ only. The solution of the above optimization problem is given by the top $d$ eigenvectors of:

\[\begin{align} \frac{\mathbf{X} \mathbf{X}^T}{N} \end{align}\]

It is worth noting that projection matrix:

\[\begin{align} \mathbf{U}^* (\mathbf{U}^*)^T \approx \mathbf{U} \mathbf{U}^T \end{align}\]

can project the noisy data point $\mathbf{x}_i$ onto subspace $\mathcal{S}$.

Reminder of linear algebra concept: Matrix multiplication, let’s say multiplied by $\mathbf{U}$ has 3 interpretations:

Transformation of a vector
Express original vector $\mathbf{x}$ in terms of a vector based on coordinate provided by $\mathbf{U}$.

Power Iteration - Workhorse of PCA

Computing full SVD for a matrix is computationally intensive. There is, however, an efficient way to comput the eigenvector of a symmetric, positive semidefinite matrix $\mathbf{M}$. The method is called power iteration.

Power iteration is a fundamental algorithm for finding the dominant eigenvector of a matrix $\mathbf{M}$. The key insight is that repeatedly applying the matrix and normalizing will converge to the eigenvector corresponding to the largest eigenvalue.

The Algorithm.

Assume that $\lambda_1 > \lambda_i$ for all $i > 1$. We want to find the fixed point of:

\[\begin{align} \mathbf{w} = \frac{\mathbf{M}\mathbf{w}}{\|\mathbf{M}\mathbf{w}\|_2} \end{align}\]

We can solve this using the following iterative procedure:

\[\begin{align} \mathbf{v}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad \mathbf{v}_{t+1} = \frac{\mathbf{M}\mathbf{v}_t}{\|\mathbf{M}\mathbf{v}_t\|_2} \end{align}\]

Why does this work?

Proof Sketch.

For any iteration $t$, we have:

\[\begin{align} \mathbf{v}_t = \frac{\mathbf{M}^t \mathbf{v}_0}{\|\mathbf{M}^t \mathbf{v}_0\|_2} \end{align}\]

Let’s decompose the initial vector $\mathbf{v}_0$ in the eigenbasis of $\mathbf{M}$:

\[\begin{align} \mathbf{v}_0 = \sum_{i=1}^D \alpha_i \mathbf{w}_i \end{align}\]

where $\mathbf{M}\mathbf{w}_i = \lambda_i \mathbf{w}_i$ and $\lambda_1 > \lambda_2 \geq \dotsb \geq \lambda_D \geq 0$.

Substituting this into our iteration formula:

\[\begin{align} \mathbf{v}_t = \frac{ \sum_{i=1}^D \lambda_i^t \alpha_i \mathbf{w}_i }{ \left\| \sum_{i=1}^D \lambda_i^t \alpha_i \mathbf{w}_i \right\|_2 } \end{align}\]

As $t \to \infty$, since $\lambda_1 > \lambda_i$ for $i > 1$, the terms with $i > 1$ vanish exponentially faster than the first term. This gives us:

\[\begin{align} \lim_{t \to \infty} \mathbf{v}_t = \frac{ \alpha_1 \mathbf{w}_1 }{ |\alpha_1| } = \operatorname{sign}(\alpha_1)\mathbf{w}_1 \end{align}\]

Therefore, $\mathbf{v}_t$ converges to a unit eigenvector of $\mathbf{M}$ corresponding to the largest eigenvalue.

Estimating the eigenvalue.

Once we have the eigenvector, the corresponding eigenvalue can be estimated as:

\[\begin{align} \lambda_1 \approx \mathbf{v}_t^\top \mathbf{M} \mathbf{v}_t \end{align}\]

This quantity converges to $\lambda_1$ at the same rate as the eigenvector convergence.

Implementation:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
 import numpy as np

def power_iteration(M, num_iterations=1000, tol=1e-6):
    """
    Compute the dominant eigenvalue and eigenvector of a symmetric positive semidefinite matrix
    using the Power Iteration method.

    Parameters:
    M (numpy.ndarray): Symmetric positive semidefinite matrix
    num_iterations (int): Maximum number of iterations
    tol (float): Convergence tolerance

    Returns:
    tuple: (eigenvalue, eigenvector)
    """
    # Input validation
    if not np.allclose(M, M.T):
        raise ValueError("Matrix must be symmetric")
    if not np.all(np.linalg.eigvals(M) >= -tol):  # Allow small negative values due to numerical errors
        raise ValueError("Matrix must be positive semidefinite")

    # Initialize random vector
    n = M.shape[0]
    v = np.random.rand(n)
    v = v / np.linalg.norm(v)  # Normalize initial vector

    # Power iteration
    for _ in range(num_iterations):
        # Matrix-vector multiplication
        v_new = M @ v

        # Compute eigenvalue (Rayleigh quotient)
        eigenvalue = np.dot(v, v_new)

        # Normalize the new vector
        v_new_norm = np.linalg.norm(v_new)
        if v_new_norm < tol:  # Check for zero vector
            raise ValueError("Power iteration converged to zero vector")
        v_new = v_new / v_new_norm

        # Check convergence
        if np.linalg.norm(v - v_new) < tol or np.linalg.norm(v + v_new) < tol:
            break

        v = v_new

    return eigenvalue, v

# Example usage

# Create a sample 3x3 symmetric positive semidefinite matrix
M = np.array([[4, 2, 0],
              [2, 5, 1],
              [0, 1, 3]])

# Run power iteration
eigenvalue, eigenvector = power_iteration(M)

print("Dominant eigenvalue:", eigenvalue)
print("Corresponding eigenvector:", eigenvector)

# Verify result
print("\nVerification:")
print("M @ eigenvector:", M @ eigenvector)
print("eigenvalue * eigenvector:", eigenvalue * eigenvector)

Limitation of PCA

PCA can denoise data by projecting original data on a single principal subspace. However, it has 2 key limitations:

It fails to model nonlinear structure
Real world data comes from multiple subspaces or surfaces. PCA also fails to model.

The following 3D interactive plots demonstrate visually and computationally why PCA is limited when the data lies on a nonlinear manifold (e.g., a sine wave in 3D). The PCA algorithm is applied to the noisy data, attempting to find the best linear subspace that fits the data. PCA projects the data onto the top 2 principal components, which are linear. The denoised result is obtained by reconstructing the data from this 2D linear subspace. However, the learned model does not offer generative model for nonlinear distribution, since only variance along the principal direction is captured. That is not enough to describe a sinusoidal relationship.

Mixtures of Subspaces and Sparsely-Used Dictionaries

A key limitation is the assumption of a single linear subspace that is responsible for generating the structured observations. In many practical applications, structure generated by a mixture of distinct low-dimensional subspaces provides a more realistic model. For example, consider a video sequence that captures the motion of several distinct objects, each subject to its own independent displacement.

In mixture of Gaussians, the random variable is generated by, randomly select a Gaussian from K Gaussians, then randomly select a random variable in that selected Gaussian.

The probability density function can be written as: $\begin{align} p(\mathbf{x}_n) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \end{align}$

Please don’t confuse this with superposition: $\begin{align} \mathbf{x} = \sum_{i=1}^{n} w_i \mathbf{x}_i, \quad \mathbf{x}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{U}_i \mathbf{U}_i^\top) \end{align}$

After assuming a mixture of Gaussian model, our task is to learn $\mathbf{U}$ which satisfy the following:

\[\begin{align} \mathbf{x} = \begin{bmatrix} | & & | \\ \mathbf{U}_1 & \cdots & \mathbf{U}_K \\ | & & | \end{bmatrix} \begin{bmatrix} \mathbf{z}_1 \\ \vdots \\ \mathbf{z}_K \end{bmatrix} = \mathbf{U}\, \mathbf{z}, \quad \left\| \begin{bmatrix} \|\mathbf{z}_1\|_2 \\ \vdots \\ \|\mathbf{z}_K\|_2 \end{bmatrix} \right\|_0 = 1 \end{align}\]

$\mathbf{U} \in \mathbb{R}^{D \times (K d)}$ is a dictionary consisting of the collection of codewords $\{\mathbf{U}_i\}_{i=1}^K$.
$\mathbf{z} \in \mathbb{R}^{K d}$ is a lifted vector that is $d$-sparse, with support on one block of size $d$.

where $\mathbf{U} \in \mathbb{R}^{D \times Kd}$

A relaxation is to assume matrix $\mathbf{U} \in \mathbb{R}^{D \times m}$ where $m$ may be smaller or larger than $D$. This leads to sparse dictionary learning problem. There are both geometric and physical/modeling motivations for considering $d \ll m$. The problem is thus turned to be an overcomplete dictionary learning problem. The motivation includes having a richer mixtures of subspaces and some computational experiments in the past reveals the additional modeling power conferred by an overcomplete representation, for example, Bruno Olshausen’s paper on overcomplete dictionary learning.

Dictionary Learning vs Traditional “by-design” method

Advantages of Dictionary Learning Over Traditional “By-Design” Methods

Traditional “by-design” methods for constructing bases—such as the Fourier Transform, Wavelet Transform, or even principal component analysis (PCA)—use mathematically engineered bases that are optimal for broad classes of signals, but agnostic to the specific data at hand. Dictionary learning, in contrast, seeks to adapt the basis (the dictionary) directly from the dataset itself, enabling several key advantages:

1. Data Adaptivity

By-Design Methods: Bases like Fourier are fixed a priori; they are not tailored for the specifics of your data.
Dictionary Learning: Learns a dictionary directly from data, giving basis elements that closely match typical data patterns (edges, textures, motifs, etc.). This often leads to sparser and more informative representations.

2. Sparsity and Efficiency

By-Design Methods: Real-world signals are often not sparsely represented in global bases like Fourier or even wavelets. For example, Discrete Cosine Transform had been applied on image compression but in fact it was a poor method to deal with images, since most real-world images have sharp edges which result in the image being not band-limited as a signal, thus it is poorly represented by the transformed representation.
Dictionary Learning: By adapting to data, learned dictionaries can represent each sample using only a few active atoms (columns), which can improve compression, denoising, and interpretability.

3. Expressivity and Overcompleteness

By-Design Methods: Predefined bases are usually complete or orthogonal but not overcomplete, limiting the richness of representation.
Dictionary Learning: Allows learning overcomplete dictionaries (more atoms than dimensions), enabling richer, more flexible representations and capturing more complex or nuanced structures in the data.

In essence, dictionary learning empowers us to discover the most efficient building blocks for our data, often leading to better representations and improved performance in practical machine learning and signal processing applications.

Overcomplete dictionary learning

To learn $\mathbf{U}$, the corresponding optimization problem can be written as follow:

\[\begin{align} \min_{\tilde{\mathbf{Z}},\, \tilde{\mathbf{A}}: \|\tilde{\mathbf{A}}_j\|_2 \leq 1} \Big\{\, \|\mathbf{X} - \tilde{\mathbf{A}}\, \tilde{\mathbf{Z}}\,\|_F^2 + \lambda \|\tilde{\mathbf{Z}}\|_1 \, \Big\} \end{align}\]

There are multiple things worth a separate blog to discuss in this optimization problem:

The constraint of having norm in each basis vector of matrix $\mathbf{A}$
Such minimization problem is non-convex. How to minimize it with alternating minization of $\mathbf{Z}$ and $\mathbf{A}$?
How does the number of rows affect the “quality” of the features learnt?

Despite the dictionary learning problem being a nonconvex problem, it is easy to see that fixing one of the 2 unknowns and optimize the other makes the problem convex and easy to solve. It has been shown that alternating minimization type algorithms indeed converge to the correct solution, at least locally. The algorithm is as follow:

\[\begin{align} \mathbf{Z}^{\ell+1} = S_{\eta\lambda}\left(\mathbf{Z}^{\ell} - 2 \eta\, \mathbf{A}^{+\top} \left(\mathbf{A}^+ \mathbf{Z}^{\ell} - \mathbf{X}\right)\right),\quad \mathbf{Z}^1=\mathbf{0},\quad \forall\, \ell \in [L] \end{align}\] \[\begin{align} \mathbf{Z}^+ = \mathbf{Z}^L \end{align}\] \[\begin{align} \mathbf{A}^{t+1} = \mathrm{proj}_{\|(\cdot)_j\|_2 \leq 1,\, \forall j}\left(\mathbf{A}^t - 2\nu\, (\mathbf{A}^t \mathbf{Z}^+ - \mathbf{X}) (\mathbf{Z}^+)^\top \right) \end{align}\] \[\begin{align} (\mathbf{A}^1)_j \sim \mathrm{i.i.d.}\, \mathcal{N}\left( \mathbf{0}, \frac{1}{D}\mathbf{I} \right),\quad \forall j \in [m] ,\quad \forall t \in [T] \end{align}\] \[\begin{align} \mathbf{A}^+ = \mathbf{A}^T \end{align}\]

The main insight from the above algorithm is that, when we fix $\mathbf{A}$, the ISTA update of $\mathbf{Z}^{\ell}$ looks like the forward pass of neural networks. Then the using the thrid line to update $\mathbf{A}$ based on the residual of using the current estimate of the sparse codes $\mathbf{Z}$

(Draft) Equation 18 and 19 are ISTA algorithm for LASSO problem.

ISTA - An algorithm for solving sparse coding in overcomplete dictionary learning

Learned ISTA (LISTA)

The above deep-network interpretation of the alternating miniziation is more conceptual than practical, as the process can be rather inefficient and take many layers or iterations to converge. This is because we are trying to infer both $\mathbf{A}$ and $\mathbf{Z}$ from $\mathbf{X}$. The problem can be simplified in a supervised setting where $(\mathbf{X}, \mathbf{Z})$ are provided and use, say, back propagation type of algorithm to learn $\mathbf{A}$.

The same methodology can be used as a basis to understand the representations computed in other network architectures, such as Transformer. Modern unsupervised learning paradigms are generally more data friendly but still, LISTA algorithm provide a useful practical basis for us to interpret the features in pretrained large-scale deep networks.

The connection between low-dimensional-structure-seeking optimization algorithms and deep network architecture suggests scalable and natural neural learning architectures which may even be usable without backpropagation.

Sparse Autoencoders

Unlike LISTA, which requires large amounts of labeled $(\mathbf{X}, \mathbf{Z})$, unsupervised learning paradigm is more data friendly. We can use our development of the LISTA algorithm above to understand common practices in this field of research.

Connection of LISTA to Sparse Autoencoder

Enjoy Reading This Article?

Here are some more articles you might like to read next:

Temporal Straightening for Latent Planning in Control Theory Language

Compression via denoising

Optimization from basics to ISTA