Understanding Independent Component Analysis

The Cocktail Party Problem
Non-Gaussianity and the Central Limit Theorem
The Negentropy Landscape
The FastICA Algorithm
Deflation: Extracting Multiple Components
The Non-Square Case
Application: fMRI Analysis
Spatial ICA vs. Temporal ICA

The Cocktail Party Problem

Consider the classical blind source separation problem. Suppose $n$ independent source signals $s_1(t), \ldots, s_n(t)$ are simultaneously active, and $n$ sensors record linear mixtures of these sources. The observation model is

$$\mathbf{x}(t) = A\,\mathbf{s}(t)$$

where $A \in \mathbb{R}^{n \times n}$ is an unknown invertible mixing matrix. Given only the observed mixtures $\mathbf{x}(t)$, the goal is to recover both $A$ and $\mathbf{s}(t)$. This is the cocktail party problem: separating individual speakers from microphone recordings that capture superpositions of all voices.

Two assumptions make this problem tractable. First, the source signals must be statistically independent — knowledge of one source provides no information about any other. Second, at most one source may have a Gaussian distribution. Under these conditions, the mixing matrix $A$ is identifiable up to permutation and scaling of its columns.

The demonstration below generates three independent sources — a sawtooth wave, a square pulse, and a noisy sinusoid — and mixes them through a random $3 \times 3$ matrix $A$. Press Run FastICA to observe the algorithm recover the original sources from the mixtures alone. Note that the recovered signals may appear reordered or sign-flipped; this inherent ambiguity is a fundamental property of ICA.

Non-Gaussianity and the Central Limit Theorem

The core mathematical insight behind ICA derives from the Central Limit Theorem (CLT): any linear combination of independent random variables tends toward a Gaussian distribution. The contrapositive is equally important — if we seek a projection direction $\mathbf{w}$ such that the scalar projection $\mathbf{w}^T\mathbf{x}$ is maximally non-Gaussian, that direction recovers one of the original independent sources.

To quantify departures from Gaussianity, ICA employs negentropy, defined as $J(y) = H(y_{\mathrm{gauss}}) - H(y)$, where $H$ denotes differential entropy and $y_{\mathrm{gauss}}$ is a Gaussian variable with the same variance as $y$. Since the Gaussian distribution maximizes entropy for a given variance, negentropy is always non-negative and equals zero if and only if $y$ is Gaussian.

Computing negentropy directly requires estimating the probability density, which is impractical in high dimensions. FastICA instead uses the approximation

$$J(y) \approx \bigl[E\{G(y)\} - E\{G(\nu)\}\bigr]^2$$

where $\nu \sim \mathcal{N}(0,1)$ and $G$ is a smooth nonlinear contrast function. Three standard choices are $G(u) = \log\cosh(u)$ (robust and general-purpose), $G(u) = u^4/4$ (sensitive to kurtosis), and $G(u) = -\exp(-u^2/2)$ (optimal for super-Gaussian sources). The derivative $g(u) = G'(u)$ serves as a scoring function in the fixed-point update, weighting each data point by its contribution to the non-Gaussianity measure.

Explore the three contrast functions below. Select a function and a distribution type to observe how $g(u)$ maps samples from different distributions. The highlighted points (amber) have large $|g(u)|$ values and contribute most strongly to the FastICA update direction.

The Negentropy Landscape

To build geometric intuition, consider the two-dimensional case. After whitening (described in the next section), the unit-norm constraint $\|\mathbf{w}\| = 1$ reduces the search space to a circle, parameterized by a single angle $\theta$. The negentropy $J(\theta)$ traces a landscape over this circle, with peaks at directions that recover independent sources.

FastICA performs approximate Newton's method on this landscape, ascending toward a peak from any starting angle. Drag the slider below to observe how the projected histogram changes with $\theta$, or press Run FastICA climb to watch the algorithm converge to a peak of $J(\theta)$.

The FastICA Algorithm

FastICA operates in three stages: centering, whitening, and fixed-point iteration.

Centering subtracts the sample mean to obtain $\tilde{\mathbf{x}} = \mathbf{x} - E\{\mathbf{x}\}$, so that $E\{\tilde{\mathbf{x}}\} = 0$. This simplifies subsequent computations without altering the mixing structure.

Whitening transforms the centered data to have identity covariance: $\mathbf{z} = V\tilde{\mathbf{x}}$, where $V = D^{-1/2}E^T$ and $C_{\mathbf{x}} = EDE^T$ is the eigendecomposition of the covariance matrix. After whitening, $E\{\mathbf{z}\mathbf{z}^T\} = I$, and the effective mixing matrix $\tilde{A} = VA$ becomes orthogonal. This is the key insight: whitening reduces the search from an arbitrary invertible matrix to an orthogonal rotation.

The fixed-point update finds each unmixing direction $\mathbf{w}_i$ by iterating

$$\mathbf{w}^{+} = E\bigl\{\mathbf{z}\cdot g(\mathbf{w}^T\mathbf{z})\bigr\} - E\bigl\{g'(\mathbf{w}^T\mathbf{z})\bigr\}\cdot\mathbf{w}$$

followed by normalization $\mathbf{w} \leftarrow \mathbf{w}^{+} / \|\mathbf{w}^{+}\|$. Here $g = G'$ is the derivative of the contrast function. The second term $E\{g'(\mathbf{w}^T\mathbf{z})\}\cdot\mathbf{w}$ provides Newton-like stabilization.

The diagram below tracks the matrix dimensions throughout the pipeline. We assume the square case ($n$ sensors for $n$ sources), so every transform matrix is $(n \times n)$ and every data matrix is $(n \times T)$.

Model: x = A s

(n × T)

observed

(n × n)

mixing

(n × T)

sources

Each column of s is one time sample of all n sources. Each column of x is one time sample of all n sensors. A is square and invertible.

Step 1: centering

x̃

(n × T)

E{x}

(n × 1)

Subtract the mean vector (broadcast across T columns). Now E{x̃} = 0.

Step 2: whitening via eigendecomposition

Cₓ

(n × n)

covariance

(n × n)

eigenvecs

(n × n)

eigenvals diag

Eᵀ

(n × n)

whitening

D⁻½

(n × n)

Eᵀ

(n × n)

(n × T)

whitened

(n × n)

x̃

(n × T)

After whitening: E{zzᵀ} = I (n × n identity). The effective mixing Ã = VA is now orthogonal (n × n).

Step 3: FastICA fixed-point iteration (per component)

Optimization objective

max_w J(w) = [E{G(wᵀz)} - E{G(v)}]²

subject to: ||w|| = 1 and w ⊥ w₁, ..., w_i-1

J(w) is the negentropy approximation. G is a smooth nonlinear function (e.g. log cosh), v ~ N(0,1) is a Gaussian reference. Maximizing J finds the direction where wᵀz is most non-Gaussian, which corresponds to an independent source. The unit-norm constraint keeps w on the unit sphere; the orthogonality constraint (via deflation) prevents rediscovering the same source.

The fixed-point update solves this optimization via a Newton-like iteration:

wᵢ

(n × 1)

one direction

←

E{z · g(wᵢᵀz)} - E{g'(wᵢᵀz)} · wᵢ

Inside the update, dimension by dimension:

wᵢᵀz

(1 × n)(n × T) = (1 × T)

→

g(wᵢᵀz)

(1 × T) elementwise

E{z · g(wᵢᵀz)}

(n × T)(T × 1) / T = (n × 1)

→

E{g'(wᵢᵀz)}

scalar (1 × 1)

The update produces a new (n × 1) vector, which is normalized to unit length. Then Gram-Schmidt against previously found w₁,...,wᵢ₋₁.

Step 4: assemble unmixing matrix

W̃

(n × n)

unmixing (whitened)

[w₁, w₂, ..., wₙ]ᵀ

stack n vectors of (n × 1) as rows

W̃ is orthogonal (W̃W̃ᵀ = I) because all wᵢ are orthonormal in whitened space.

(n × n)

full unmixing

W̃

(n × n)

W = W̃ · V combines whitening + rotation. W is generally NOT orthogonal (unlike W̃).

Step 5: recover sources

ŝ

(n × T)

recovered

(n × n)

x̃

(n × T)

Equivalently: ŝ = W̃ · z. Each row of ŝ is one recovered independent component across T time points. Ideally W ≈ A⁻¹ up to permutation and scaling.

Dimension summary

n = number of sources = number of sensors
T = number of time samples (T >> n typically)
s, x, x̃, z, ŝ : (n × T) — signal matrices
A, Cₓ, E, D, V, W̃, W : (n × n) — square transforms
wᵢ : (n × 1) — single unmixing direction
wᵢᵀz : (1 × T) — projection of all samples onto wᵢ

A few remarks on computational cost. Inside the fixed-point update, $\mathbf{w}_i^T\mathbf{z}$ produces a $(1 \times T)$ vector — the projection of all $T$ samples onto the current direction. The function $g(\cdot)$ is applied elementwise and is therefore $O(T)$ per iteration. The expectation $E\{\mathbf{z}\cdot g(\mathbf{w}_i^T\mathbf{z})\}$ is a matrix-vector product $(n \times T)\cdot(T \times 1)/T$, yielding the new $(n \times 1)$ direction.

The interactive demonstration at the end of the next section walks through the complete FastICA pipeline — from raw sources through centering, whitening, and fixed-point iteration to the recovered independent components.

Deflation: Extracting Multiple Components

After finding $\mathbf{w}_1$, how is $\mathbf{w}_2$ determined? Running the same fixed-point iteration with a different random initialization will, in general, converge to the same peak — both directions find the strongest non-Gaussian projection. The solution is deflationary Gram-Schmidt orthogonalization. After each fixed-point update of $\mathbf{w}_2$, we subtract its projection onto $\mathbf{w}_1$ and renormalize:

$$\mathbf{w}_2 \leftarrow \mathbf{w}_2 - (\mathbf{w}_2^T\mathbf{w}_1)\,\mathbf{w}_1, \qquad \mathbf{w}_2 \leftarrow \frac{\mathbf{w}_2}{\|\mathbf{w}_2\|}$$

This orthogonality constraint is valid in the whitened space because whitening makes the effective mixing matrix orthogonal: $\tilde{A}^T\tilde{A} = I$. The true unmixing directions in whitened space are therefore orthonormal.

An important distinction follows. The directions $\mathbf{w}_1$ and $\mathbf{w}_2$ are orthogonal in the whitened space, but the full unmixing vectors $W_1$ and $W_2$ in the original observation space are generally not orthogonal. Unlike PCA, ICA components are statistically independent but not geometrically orthogonal.

The visualization below demonstrates the complete FastICA pipeline, including deflation. Click through all eight phases: from the original sources, through preprocessing, to finding both unmixing directions and recovering the independent components. In particular, observe how naive iteration for $\mathbf{w}_2$ (phase 6) converges to the same direction as $\mathbf{w}_1$, and how Gram-Schmidt orthogonalization (phase 7) forces it to the orthogonal complement.

The Non-Square Case

When the number of sensors $m$ exceeds the number of sources $n$, the only modification occurs in the whitening step. The covariance eigendecomposition retains only the top $n$ eigenvalues, producing a whitening matrix $V \in \mathbb{R}^{n \times m}$ that simultaneously reduces dimensionality and whitens. After this step, $\mathbf{z} = V\tilde{\mathbf{x}}$ is $(n \times T)$, and the FastICA core operates identically to the square case.

The comparison below highlights precisely where the two pipelines diverge. The amber-highlighted dimensions are those that change; everything in the whitened space (blue) remains $(n \times n)$.

Square case: m = n

Input

x (n × T) = A (n × n) s (n × T)

↓

1 Center

x̃ (n × T) = x - E{x}

↓

2 Whiten

V (n × n)

z (n × T) = V · x̃

All n eigenvalues used. No dimension change.

↓

3 FastICA

wᵢ (n × 1) for i = 1..n

max J(w) = [E{G(wᵀz)} - E{G(v)}]²

↓

4 Assemble

W̃ (n × n) orthogonal

W = W̃ · V (n × n)

↓

5 Recover

ŝ (n × T) = W · x̃

Non-square case: m > n

Input

x (m × T) = A (m × n) s (n × T)

↓

1 Center

x̃ (m × T) = x - E{x}

↓

2 PCA reduce + whiten

Cₓ (m × m) → keep top n eigenvecs

V (n × m)

z (n × T) = V · x̃

Discard m-n smallest eigenvalues. V reduces m dims to n dims AND whitens in one step.

↓

3 FastICA (same!)

wᵢ (n × 1) for i = 1..n

max J(w) = [E{G(wᵀz)} - E{G(v)}]²

↓

4 Assemble

W̃ (n × n) orthogonal

W = W̃ · V (n × m)

↓

5 Recover

ŝ (n × T) = W · x̃

What changes in the non-square case

m = sensors (e.g. EEG channels, fMRI voxels), n = sources to extract (n < m)
Only these dimensions change: A (m×n), V (n×m), W (n×m)
Everything in the whitened space stays (n×n) or (n×1) — identical algorithm

The PCA step in non-square ICA does two things at once:
1. Dimensionality reduction: m dims → n dims (discard noise subspace)
2. Whitening: E{zzᵀ} = I_n
After this, z is (n × T) and the FastICA core is exactly the same.

Application: fMRI Analysis

Independent Component Analysis is a standard tool in functional neuroimaging, where it decomposes brain activity into spatially or temporally independent networks. A 4D fMRI volume (three spatial dimensions plus time) is first reshaped into a 2D matrix $X$ by flattening all spatial dimensions into rows. Each row represents the BOLD time series at one voxel.

PCA reduces the dimensionality from $m$ voxels to $n$ components (typically $n \approx 20$–$40$) while simultaneously whitening. FastICA then operates in the compact $(n \times n)$ whitened space. Finally, each row of the unmixing matrix $W$ is reshaped back into a 3D brain map, revealing spatially localized networks. The resulting components include both neural networks (default mode, visual, motor) and structured noise (head motion, physiological pulsation), the latter identifiable by edge-localized spatial patterns and spiky temporal profiles.

Click through the pipeline steps below to trace the transformation from raw 4D data to interpreted brain networks.

Spatial ICA vs. Temporal ICA

A fundamental observation: ICA always enforces statistical independence on the rows of $S$ in the decomposition $X = AS$. The orientation of the data matrix $X$ therefore determines the nature of the independent components.

In spatial ICA, the data matrix is organized as $X \in \mathbb{R}^{T \times m}$ (time $\times$ voxels). The rows of $S$ are then $m$-dimensional spatial vectors — independent spatial maps. The columns of $A$ are the associated time courses, which carry no independence guarantee. This formulation exploits the spatial sparsity of brain networks: each network occupies a localized set of voxels, and most voxels participate in at most one network. Spatial ICA is the standard approach in tools such as FSL MELODIC and the Human Connectome Project.

In temporal ICA, the matrix is transposed: $X \in \mathbb{R}^{m \times T}$. The rows of $S$ are now $T$-dimensional temporal vectors — independent time courses. The columns of $A$ give spatial maps that may overlap. This formulation assumes that the temporal dynamics of different processes (task responses, slow drifts, cardiac pulsation) are statistically independent.

The cocktail party problem is temporal ICA: the source signals are temporal waveforms, and independence holds across time. A useful mnemonic: the dimension in the columns of $X$ is the dimension along which ICA enforces independence, because it becomes the row dimension of $S$.

Spatial ICA (sICA)

Data matrix fed to ICA

X = (T time x m voxels)

Independence assumption

Spatial maps are statistically independent across voxels

Decomposition X = A S

(T x m) = (T x n) (n x m)

Rows of S: (1 x m) = spatial maps (independent!)
Cols of A: (T x 1) = time courses (may be correlated)

Why it works

Brain networks are spatially sparse and localized. Most voxels belong to at most one network. m >> T gives plenty of spatial samples. This is the standard approach (HCP, UK Biobank, FSL MELODIC).

PCA reduction axis

T → n (temporal reduction, cheap)

Temporal ICA (tICA)

Data matrix fed to ICA

X = (m voxels x T time)

Independence assumption

Time courses are statistically independent across time

Decomposition X = A S

(m x T) = (m x n) (n x T)

Rows of S: (1 x T) = time courses (independent!)
Cols of A: (m x 1) = spatial maps (may overlap)

Why it works

Task responses, slow drifts, cardiac pulsation, and spontaneous fluctuations have truly distinct temporal dynamics. Needs T >> n (long scans, fast TR). Used in HCP dense connectome analyses (Glasser et al. 2018).

PCA reduction axis

m → n (spatial reduction, expensive)

The universal rule: ICA always makes rows of S independent

The only difference between sICA and tICA is the orientation of the data matrix. In sICA, voxels go in the columns so that rows of S are m-dimensional spatial vectors. In tICA, time goes in the columns so that rows of S are T-dimensional temporal vectors. The FastICA algorithm itself is identical in both cases.

A useful mnemonic: the dimension that appears in the columns of X is the dimension along which ICA enforces independence (because it becomes the row dimension of S).

Correcting the earlier pipeline diagrams

In the fMRI pipeline I showed earlier, I wrote X = (m x T) and called it spatial ICA. That was wrong. With X = (m x T), the rows of S are T-dimensional = time courses, which is temporal ICA. For spatial ICA, the pipeline should start with X = (T x m), or equivalently transpose X before feeding it to FastICA. The matrix algebra in the whitening and iteration steps remains the same; only the interpretation of rows vs columns changes.