# Advances in Convex NMF: Linear Programming

Today I am going to look at a very important advance in one of my favorite Machine Learning algorithms,   NMF (Non-Negative Matrix Factorization)  [1].

NMF is a curious algorithm in that it allows us to do something very natural and seemingly straightforward, namely, to find clusters of data points, and yet, mathematically, it is known to be non-convex and NP-HARD [2].   Practically, this means that NMF–and many other common clustering algorithms, such as KMeans, LDA, etc.–can suffer from convergence problems and yield unstable and/or just plain unphysical results.

Several approaches exist to improve traditional NMF [1],  such as sparse NMF ala Hoyer [3], using SVD initialization (i.e. NNDSVD) [4], and the convex NMF (CNMF) approach suggested by Jordan et. al. [5]

Recently, however, it has been shown how to re-formulate the traditional NMF problem, under fairly reasonable assumptions, as a true convex optimization problem, namely a linear programming (LP) problem [6].   Is this new? Is it revolutionary? Yes and no.

In fact, it has been known for some time that CNMF could be re-formulated as purely convex optimization, called Convex-hull non-negative matrix factorization (CHNMF).  Indeed, both CNMF and CHNMF are implemented in the Python Matrix Factorization (PyMF)  module.  The PyMF package, however, is not very fast, and in production, I would  recommend using the more popular SciKit Learn NMF, with an NNDSVD seed, over the PyMF CNMF or CHNMF implementations, unless you really, really need a (more) sparse NMF solution.

So what is so exciting about the Bittorf et. al.  paper?  Their  ‘CNMF-LP’ problem can be solved using a very fast, shared-memory, lock-free implementation of a Stochastic Conjugate Gradient (SGD) solver, called hottopix. (This is same kind of solver applied in other very high performance ML toolkits, such as the fantastic GraphLab package)  IMHO, this is HUGE.  It means we can stop wasting time with half-baked approaches like Hadoop+Mahout and actually solve very large scale, applied problems with the same performance we have come to expect from our SVMs.

I will break this post into a 3-4 parts.   In this first post, I review the basic re-formulation of convex-NMF as a linear program.

#### Standard NMF problem

Let $Y$ be a nonnegative   $(f\times n)$ data matrix with the $f$ rows indexing the features and the $n$ columns indexing the examples. That is,

$Y=(\mathbf{y}_{1},\mathbf{y}_{2}\cdots\mathbf{y}_{n})$

where $\mathbf{y}_{k}$ is a column, data vector. In standard NMF, we seek to factor $Y$ into $r$ components, such that

$Y\approx FW$, where

$F$ is an $(f\times r)$ feature matrix, and $W$ an $(r\times n)$ weight matrix, and both $F,W$ are non-negative matrices.

#### Convex NMF (NMF) Problem

In Convex NMF (Jordan et. al.), we require that the feature vectors $\mathbf{f}_{k}$ be convex linear combinations of the original data matrix, such that

$\mathbf{f}_{k}=c_{1,k}\mathbf{y}_{1}+c_{2,k}\mathbf{y}_{z}+\ldots+c_{n,k}\mathbf{y}_{n}=Y\mathbf{c}_{k}$

This lets us relate the feature vectors $F=(\mathbf{f}_{1},\mathbf{f}_{2}\cdots\mathbf{f}_{r})$ to $r$ centroids of the data, as is common in Kmeans methods.  Also, this leads to very sparse solutions.  Lets us write

$F=YC$.

where $C$ is an $(n\times r)$ matrix,  called the factorization localizing matrix.   This lets express  convex NMF approximation as

$Y\approx YCW$

although in other papers we define $C$ differently, and  see the more cryptic form $Y\approx YC$, which is the starting point for grokking the  LP formulation.

#### Notational Confusion:  from CNMF to ‘CNMF-LP’

We make a short side-note on notation.

The Jordan et. al.  paper calls the data matrix $X$ and writes the factorization as

$X=FG^{\dagger}$,

and

$X=XWG^{\dagger}$.

The Bittorf et. al. paper writes the data matrix  the $X$ but adds the additional pre-factorization as

$X=Y+\Delta$

The $X$ matrix is the true data matrix, and the $Y$ matrix is that part which satisfies the constraints to make the problem convex. They assume $\Delta$ is small a in formal sense in order to make their theoretical argument and form an error bound (not addressed here).  They then write the  factorization almost interchangeably as

$X=FW$,

and / or

$Y=FW$.

Also, note that they write the matrix $C$ on the left:

$X=CX$,

and / or

$Y=CY$.

(switching between  C  on the left or right makes it hard to compare the details in the 2 CNMF papers,  so we only sketch out the ideas here, and will work through the explicit details for you in the next post when we code it up)

#### Exact Convex NMF and Linear Programming

Certainly there are cases where NMF is well posed.  That is, the data is sparse enough and the centroids separable enough that even though the Jordan et. al. convex NMF per-se is non-convex and probably NP-HARD, any practical solution with a good initial starting value will rapidly converge to a stable and unique solution.

The Bittorf et. al. observe that, by definition, for any well-posed convex NMF, the matrix elements of the factor loading matrix $C_{i,j}$  form a polyhedral set $\Phi(Y)$, such that

$\Phi(Y)=\left\{C\geq 0:Y=CY,tr(C)=r,C_{j,j}\leq 1\forall j,C_{i,j}\leq C_{j,j}\forall i,j\right\}$

To solve an exact and well posed convex NMF problem, one simply needs to find a feasible solution of the set $\Phi(Y)$, such that all the diagonal elements  $C_{i,j}=1$.  This is readily accomplished by solving the associated linear programming problem!

#### Real-World Convex NMF

To solve a real-world problem, we assume that the data matrix  $X$ is nearly well-posed, upto some small amount of noise.  We then solve the associated LP problem, where we replace the exact constraint $Y=CY$ with the relaxed constraint $\parallel CX-X\parallel\leq\tau$.  This yields

$\min\mathbf{p}^{\dagger}C_{diag}$

subject to the constraints

$C\geq 0$;   $\parallel CX-X\parallel\leq\tau$;  $tr(C)=r$; and

$C_{i,i}\leq 1$$C_{i,j}\leq C_{i,i}$

where $\mathbf{p}$ is some arbitrary vector, and $\tau$ is an adjustable, slack parameter.

We now form the Feature matrix $F$ with yet a second LP program,

$\arg\min_{Z\geq 0}\parallel X-ZW\parallel_{\infty,1}$.

This entire problem is now readily addressable and solving using off the self convex-optimization libraries, such as Matlab and/or the cvxopt package in Python.  In our next post, we will look at some toy problems and how to solve them directly using these techniques.

#### Lagrangian form of Convex NMF

I don’t know about you, but I always prefer to see the Lagrangian, or dual, form of a convex optimization problem, just to make sure I understand the definition and the constraints.  Additionally, eventually we will want a high performance solver that we can run on very large data sets, so it is also very practical to review these equations now.

Bittorf et. al. provide an Incremental Gradient Descent algorithm where they dualize a part of the constraints.    The corresponding Lagrange multiplier problem is

$\mathcal{L}(C,\beta,\mathbf{\omega})=\mathbf{p}^{\dagger}C_{diag}+\beta(tr(C)-r)+\sum_{i=1}^{f}\omega_{i}\left(\parallel X_{i}-(CX)_{i}\parallel_{1}-\tau\right)$

with the constraint set

$\phi_{0}=\left\{C:C\geq 0,diag(C)\leq 1,C_{i,j}\leq C_{j,j}\forall i,j\right\}$

Since the  1-norm that appears above, this requires special techniques such as subgradients.  They suggest breaking the problem into two parts, a Dual Subgradient Ascent approach, that solves this problem incrementally by

1. alternating between minimizing the Lagrangian $\mathcal{L}$ over the constraint set $\phi_{0}$ to find a solution $C^{*}$, and then

2. plugging $C^{*}$ into a subgradient step with respect to the dual variables

We will go into more details in a future post.

#### Conclusion:  Clustering via LP

So there we have it.  A formulation of a convex NMF–a workhorse algorithm of applied machine learning–as a Linear Program (i.e. CNMF-LP), which can be solved using high performance, large scale convex optimization techniques.  This  has the potential to make clustering of very data scale data sets as easy as it is now to run very large scale linear SVMs.

It is, IMHO, a game changer.

#### References

1. Lee and Seung (1999), Learning the parts of objects by non-negative matrix factorization

2 Donoho & Stodden (2004), When does non-negative matrix factorization give a correct decomposition into parts?

3. Hoyer (2004), Non-negative Matrix Factorization with Sparseness Constraints

4. Boutsidis & Gallopoulos (2007), SVD based initialization: A head start for nonnegative matrix factorization

5. Ding, Li, & Jordan (2006), Convex and Semi-Nonnegative Matrix Factorizations

6. Bittorf, Recht, Re, & Troppy (2013), Factoring nonnegative matrices with linear programs

7. see: Python Matrix Factorization (PyMF) module  and references within