In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.



Example: Rank-two approximation via SVD

A matrix A with row size m = 4 and column size n = 5



The SVD is given by A = USVT, with


,  ,  


The matrix is rank r = 3. A rank-two approximation is given by zeroing out the smallest singular value, which produces



We check that the Frobenius norm of the error ||A - A2||F is the sum of singular values we have zeroed out, which here reduces to σ3 = √5:




Dimensionality Reduction for Information Retrieval

The choice of r is done "by seat of the pants". How many r singular values or dimensions to keep is done more or less arbitrarily or must be determined experimentally since each collection is different. This is an active open area of research. Studies with thousand of documents suggest that r values around 100 give better retrieval performance. Definitely the selection of r impacts results.



References