Understanding Independent Component Analysis
The Cocktail Party Problem
Consider the classical blind source separation problem. Suppose $n$ independent source signals $s_1(t), \ldots, s_n(t)$ are simultaneously active, and $n$ sensors record linear mixtures of these sources. The observation model is
$$\mathbf{x}(t) = A\,\mathbf{s}(t)$$where $A \in \mathbb{R}^{n \times n}$ is an unknown invertible mixing matrix. Given only the observed mixtures $\mathbf{x}(t)$, the goal is to recover both $A$ and $\mathbf{s}(t)$. This is the cocktail party problem: separating individual speakers from microphone recordings that capture superpositions of all voices.
Two assumptions make this problem tractable. First, the source signals must be statistically independent — knowledge of one source provides no information about any other. Second, at most one source may have a Gaussian distribution. Under these conditions, the mixing matrix $A$ is identifiable up to permutation and scaling of its columns.
The demonstration below generates three independent sources — a sawtooth wave, a square pulse, and a noisy sinusoid — and mixes them through a random $3 \times 3$ matrix $A$. Press Run FastICA to observe the algorithm recover the original sources from the mixtures alone. Note that the recovered signals may appear reordered or sign-flipped; this inherent ambiguity is a fundamental property of ICA.
Non-Gaussianity and the Central Limit Theorem
The core mathematical insight behind ICA derives from the Central Limit Theorem (CLT): any linear combination of independent random variables tends toward a Gaussian distribution. The contrapositive is equally important — if we seek a projection direction $\mathbf{w}$ such that the scalar projection $\mathbf{w}^T\mathbf{x}$ is maximally non-Gaussian, that direction recovers one of the original independent sources.
To quantify departures from Gaussianity, ICA employs negentropy, defined as $J(y) = H(y_{\mathrm{gauss}}) - H(y)$, where $H$ denotes differential entropy and $y_{\mathrm{gauss}}$ is a Gaussian variable with the same variance as $y$. Since the Gaussian distribution maximizes entropy for a given variance, negentropy is always non-negative and equals zero if and only if $y$ is Gaussian.
Computing negentropy directly requires estimating the probability density, which is impractical in high dimensions. FastICA instead uses the approximation
$$J(y) \approx \bigl[E\{G(y)\} - E\{G(\nu)\}\bigr]^2$$where $\nu \sim \mathcal{N}(0,1)$ and $G$ is a smooth nonlinear contrast function. Three standard choices are $G(u) = \log\cosh(u)$ (robust and general-purpose), $G(u) = u^4/4$ (sensitive to kurtosis), and $G(u) = -\exp(-u^2/2)$ (optimal for super-Gaussian sources). The derivative $g(u) = G'(u)$ serves as a scoring function in the fixed-point update, weighting each data point by its contribution to the non-Gaussianity measure.
Explore the three contrast functions below. Select a function and a distribution type to observe how $g(u)$ maps samples from different distributions. The highlighted points (amber) have large $|g(u)|$ values and contribute most strongly to the FastICA update direction.
The Negentropy Landscape
To build geometric intuition, consider the two-dimensional case. After whitening (described in the next section), the unit-norm constraint $\|\mathbf{w}\| = 1$ reduces the search space to a circle, parameterized by a single angle $\theta$. The negentropy $J(\theta)$ traces a landscape over this circle, with peaks at directions that recover independent sources.
FastICA performs approximate Newton's method on this landscape, ascending toward a peak from any starting angle. Drag the slider below to observe how the projected histogram changes with $\theta$, or press Run FastICA climb to watch the algorithm converge to a peak of $J(\theta)$.
The FastICA Algorithm
FastICA operates in three stages: centering, whitening, and fixed-point iteration.
Centering subtracts the sample mean to obtain $\tilde{\mathbf{x}} = \mathbf{x} - E\{\mathbf{x}\}$, so that $E\{\tilde{\mathbf{x}}\} = 0$. This simplifies subsequent computations without altering the mixing structure.
Whitening transforms the centered data to have identity covariance: $\mathbf{z} = V\tilde{\mathbf{x}}$, where $V = D^{-1/2}E^T$ and $C_{\mathbf{x}} = EDE^T$ is the eigendecomposition of the covariance matrix. After whitening, $E\{\mathbf{z}\mathbf{z}^T\} = I$, and the effective mixing matrix $\tilde{A} = VA$ becomes orthogonal. This is the key insight: whitening reduces the search from an arbitrary invertible matrix to an orthogonal rotation.
The fixed-point update finds each unmixing direction $\mathbf{w}_i$ by iterating
$$\mathbf{w}^{+} = E\bigl\{\mathbf{z}\cdot g(\mathbf{w}^T\mathbf{z})\bigr\} - E\bigl\{g'(\mathbf{w}^T\mathbf{z})\bigr\}\cdot\mathbf{w}$$followed by normalization $\mathbf{w} \leftarrow \mathbf{w}^{+} / \|\mathbf{w}^{+}\|$. Here $g = G'$ is the derivative of the contrast function. The second term $E\{g'(\mathbf{w}^T\mathbf{z})\}\cdot\mathbf{w}$ provides Newton-like stabilization.
The diagram below tracks the matrix dimensions throughout the pipeline. We assume the square case ($n$ sensors for $n$ sources), so every transform matrix is $(n \times n)$ and every data matrix is $(n \times T)$.
A few remarks on computational cost. Inside the fixed-point update, $\mathbf{w}_i^T\mathbf{z}$ produces a $(1 \times T)$ vector — the projection of all $T$ samples onto the current direction. The function $g(\cdot)$ is applied elementwise and is therefore $O(T)$ per iteration. The expectation $E\{\mathbf{z}\cdot g(\mathbf{w}_i^T\mathbf{z})\}$ is a matrix-vector product $(n \times T)\cdot(T \times 1)/T$, yielding the new $(n \times 1)$ direction.
The interactive demonstration at the end of the next section walks through the complete FastICA pipeline — from raw sources through centering, whitening, and fixed-point iteration to the recovered independent components.
Deflation: Extracting Multiple Components
After finding $\mathbf{w}_1$, how is $\mathbf{w}_2$ determined? Running the same fixed-point iteration with a different random initialization will, in general, converge to the same peak — both directions find the strongest non-Gaussian projection. The solution is deflationary Gram-Schmidt orthogonalization. After each fixed-point update of $\mathbf{w}_2$, we subtract its projection onto $\mathbf{w}_1$ and renormalize:
$$\mathbf{w}_2 \leftarrow \mathbf{w}_2 - (\mathbf{w}_2^T\mathbf{w}_1)\,\mathbf{w}_1, \qquad \mathbf{w}_2 \leftarrow \frac{\mathbf{w}_2}{\|\mathbf{w}_2\|}$$This orthogonality constraint is valid in the whitened space because whitening makes the effective mixing matrix orthogonal: $\tilde{A}^T\tilde{A} = I$. The true unmixing directions in whitened space are therefore orthonormal.
An important distinction follows. The directions $\mathbf{w}_1$ and $\mathbf{w}_2$ are orthogonal in the whitened space, but the full unmixing vectors $W_1$ and $W_2$ in the original observation space are generally not orthogonal. Unlike PCA, ICA components are statistically independent but not geometrically orthogonal.
The visualization below demonstrates the complete FastICA pipeline, including deflation. Click through all eight phases: from the original sources, through preprocessing, to finding both unmixing directions and recovering the independent components. In particular, observe how naive iteration for $\mathbf{w}_2$ (phase 6) converges to the same direction as $\mathbf{w}_1$, and how Gram-Schmidt orthogonalization (phase 7) forces it to the orthogonal complement.
The Non-Square Case
When the number of sensors $m$ exceeds the number of sources $n$, the only modification occurs in the whitening step. The covariance eigendecomposition retains only the top $n$ eigenvalues, producing a whitening matrix $V \in \mathbb{R}^{n \times m}$ that simultaneously reduces dimensionality and whitens. After this step, $\mathbf{z} = V\tilde{\mathbf{x}}$ is $(n \times T)$, and the FastICA core operates identically to the square case.
The comparison below highlights precisely where the two pipelines diverge. The amber-highlighted dimensions are those that change; everything in the whitened space (blue) remains $(n \times n)$.
Application: fMRI Analysis
Independent Component Analysis is a standard tool in functional neuroimaging, where it decomposes brain activity into spatially or temporally independent networks. A 4D fMRI volume (three spatial dimensions plus time) is first reshaped into a 2D matrix $X$ by flattening all spatial dimensions into rows. Each row represents the BOLD time series at one voxel.
PCA reduces the dimensionality from $m$ voxels to $n$ components (typically $n \approx 20$–$40$) while simultaneously whitening. FastICA then operates in the compact $(n \times n)$ whitened space. Finally, each row of the unmixing matrix $W$ is reshaped back into a 3D brain map, revealing spatially localized networks. The resulting components include both neural networks (default mode, visual, motor) and structured noise (head motion, physiological pulsation), the latter identifiable by edge-localized spatial patterns and spiky temporal profiles.
Click through the pipeline steps below to trace the transformation from raw 4D data to interpreted brain networks.
Spatial ICA vs. Temporal ICA
A fundamental observation: ICA always enforces statistical independence on the rows of $S$ in the decomposition $X = AS$. The orientation of the data matrix $X$ therefore determines the nature of the independent components.
In spatial ICA, the data matrix is organized as $X \in \mathbb{R}^{T \times m}$ (time $\times$ voxels). The rows of $S$ are then $m$-dimensional spatial vectors — independent spatial maps. The columns of $A$ are the associated time courses, which carry no independence guarantee. This formulation exploits the spatial sparsity of brain networks: each network occupies a localized set of voxels, and most voxels participate in at most one network. Spatial ICA is the standard approach in tools such as FSL MELODIC and the Human Connectome Project.
In temporal ICA, the matrix is transposed: $X \in \mathbb{R}^{m \times T}$. The rows of $S$ are now $T$-dimensional temporal vectors — independent time courses. The columns of $A$ give spatial maps that may overlap. This formulation assumes that the temporal dynamics of different processes (task responses, slow drifts, cardiac pulsation) are statistically independent.
The cocktail party problem is temporal ICA: the source signals are temporal waveforms, and independence holds across time. A useful mnemonic: the dimension in the columns of $X$ is the dimension along which ICA enforces independence, because it becomes the row dimension of $S$.