(this post has been published to check the links–sorry it is not ready yet)

In our previous post, we discussed how to distinguish a correlation from a causation, looking a technique developed by economists called Granger Causality. Here, take a more pedagogic look at the details, from the physics perspective, and another Nobel prize winner

#### Stochastic Problems in Astronomy and Physics

There is a classic review of Stochastic Problems in Astronomy and Physics [1] by the late Subrahmanyan Chandrasekhar, Nobel Laureate (Physics, 1983) and former head of the department of   physics when I was  student at the University of Chicago.    In particular,  Chandrasekhar reviews 2 kinds of  processes:

• random walk , aka Wiener process
• mean reverting walk, aka  Ornstein-Uhlenbeck process

I am a firm believer that every scientist should study and know the great works of the Nobel Laureates, so I strongly encourage the reader to dig deep [1] on their own–or to at least  continue reading

Recall that we wish to distinguish between causation and correlation.  In this context, we shall mean that we have an underlying model for our times series, and we wish to know if our model is sensible and predictive – it is an underlying cause – or we have been fooled by randomness… we see patterns that are not really there.

#### mean reverting process

In this sense, we seek to distinguish between a correlation and causation by distinguishing between having a random walk or a mean-reverting process.

#### Tests for Mean Reversion

How might we determine if we have a random walk or a mean reverting processes?   Let us try to express our time series as

$x_{t+1}=\rho x_{t}+\epsilon$

Is this a good model?  Do data points in the past ($x_{t}$) say anything about data points in the future ($x_{t+1}$) ?

Estimate the regression coefficient $\rho$ of $x_{t+1}$ on $x_{t}$. If the coefficient  $\rho$ is significantly below 1, it means that the process is mean reverting; if it is close to 1, the process is a random walk.

##### Phillips-Perron test

Search for a unit root in the equation linking $x_{t}$ to $x_{t+1}$ .  A high p-value reinforces the hypothesis of a unit root.

#### References

Today I am going to look at a very important advance in one of my favorite Machine Learning algorithms,   NMF (Non-Negative Matrix Factorization)  [1].

NMF is a curious algorithm in that it allows us to do something very natural and seemingly straightforward, namely, to find clusters of data points, and yet, mathematically, it is known to be non-convex and NP-HARD [2].   Practically, this means that NMF–and many other common clustering algorithms, such as KMeans, LDA, etc.–can suffer from convergence problems and yield unstable and/or just plain unphysical results.

Several approaches exist to improve traditional NMF [1],  such as sparse NMF ala Hoyer [3], using SVD initialization (i.e. NNDSVD) [4], and the convex NMF (CNMF) approach suggested by Jordan et. al. [5]

Recently, however, it has been shown how to re-formulate the traditional NMF problem, under fairly reasonable assumptions, as a true convex optimization problem, namely a linear programming (LP) problem [6].   Is this new? Is it revolutionary? Yes and no.

In fact, it has been known for some time that CNMF could be re-formulated as purely convex optimization, called Convex-hull non-negative matrix factorization (CHNMF).  Indeed, both CNMF and CHNMF are implemented in the Python Matrix Factorization (PyMF)  module.  The PyMF package, however, is not very fast, and in production, I would  recommend using the more popular SciKit Learn NMF, with an NNDSVD seed, over the PyMF CNMF or CHNMF implementations, unless you really, really need a (more) sparse NMF solution.

So what is so exciting about the Bittorf et. al.  paper?  Their  ’CNMF-LP’ problem can be solved using a very fast, shared-memory, lock-free implementation of a Stochastic Conjugate Gradient (SGD) solver, called hottopix. (This is same kind of solver applied in other very high performance ML toolkits, such as the fantastic GraphLab package)  IMHO, this is HUGE.  It means we can stop wasting time with half-baked approaches like Hadoop+Mahout and actually solve very large scale, applied problems with the same performance we have come to expect from our SVMs.

I will break this post into a 3-4 parts.   In this first post, I review the basic re-formulation of convex-NMF as a linear program.

#### Standard NMF problem

Let $Y$ be a nonnegative   $(f\times n)$ data matrix with the $f$ rows indexing the features and the $n$ columns indexing the examples. That is,

$Y=(\mathbf{y}_{1},\mathbf{y}_{2}\cdots\mathbf{y}_{n})$

where $\mathbf{y}_{k}$ is a column, data vector. In standard NMF, we seek to factor $Y$ into $r$ components, such that

$Y\approx FW$, where

$F$ is an $(f\times r)$ feature matrix, and $W$ an $(r\times n)$ weight matrix, and both $F,W$ are non-negative matrices.

#### Convex NMF (NMF) Problem

In Convex NMF (Jordan et. al.), we require that the feature vectors $\mathbf{f}_{k}$ be convex linear combinations of the original data matrix, such that

$\mathbf{f}_{k}=c_{1,k}\mathbf{y}_{1}+c_{2,k}\mathbf{y}_{z}+\ldots+c_{n,k}\mathbf{y}_{n}=Y\mathbf{c}_{k}$

This lets us relate the feature vectors $F=(\mathbf{f}_{1},\mathbf{f}_{2}\cdots\mathbf{f}_{r})$ to $r$ centroids of the data, as is common in Kmeans methods.  Also, this leads to very sparse solutions.  Lets us write

$F=YC$.

where $C$ is an $(n\times r)$ matrix,  called the factorization localizing matrix.   This lets express  convex NMF approximation in the more cryptic form

$Y\approx YC$

which is the starting point for grokking the  LP formulation

#### Notational Confusion:  from CNMF to ‘CNMF-LP’

We make a short side-note on notation.

The Jordan et. al.  paper calls the data matrix $X$ and writes the factorization as

$X=FG^{\dagger}$,

and

$X=XWG^{\dagger}$.

The Bittorf et. al. paper writes the data matrix  the $X$ but adds the additional pre-factorization as

$X=Y+\Delta$

The $X$ matrix is the true data matrix, and the $Y$ matrix is that part which satisfies the constraints to make the problem convex. They assume $\Delta$ is small a in formal sense in order to make their theoretical argument and form an error bound (not addressed here).  They then write the  factorization almost interchangeably as

$X=FW$,

and / or

$Y=FW$.

Also, note that they write the matrix $C$ on the left:

$X=CX$,

and / or

$Y=CY$.

(switching between  C  on the left or right makes it hard to compare the details in the 2 CNMF papers,  so we only sketch out the ideas here, and will work through the explicit details for you in the next post when we code it up)

#### Exact Convex NMF and Linear Programming

Certainly there are cases where NMF is well posed.  That is, the data is sparse enough and the centroids separable enough that even though the Jordan et. al. convex NMF per-se is non-convex and probably NP-HARD, any practical solution with a good initial starting value will rapidly converge to a stable and unique solution.

The Bittorf et. al. observe that, by definition, for any well-posed convex NMF, the matrix elements of the factor loading matrix $C_{i,j}$  form a polyhedral set $\Phi(Y)$, such that

$\Phi(Y)=\left\{C\geq 0:Y=CY,tr(C)=r,C_{j,j}\leq 1\forall j,C_{i,j}\leq C_{j,j}\forall i,j\right\}$

To solve an exact and well posed convex NMF problem, one simply needs to find a feasible solution of the set $\Phi(Y)$, such that all the diagonal elements  $C_{i,j}=1$.  This is readily accomplished by solving the associated linear programming problem!

#### Real-World Convex NMF

To solve a real-world problem, we assume that the data matrix  $X$ is nearly well-posed, upto some small amount of noise.  We then solve the associated LP problem, where we replace the exact constraint $Y=CY$ with the relaxed constraint $\parallel CX-X\parallel\leq\tau$.  This yields

$\min\mathbf{p}^{\dagger}C_{diag}$

subject to the constraints

$C\geq 0$;   $\parallel CX-X\parallel\leq\tau$;  $tr(C)=r$; and

$C_{i,i}\leq 1$$C_{i,j}\leq C_{i,i}$

where $\mathbf{p}$ is some arbitrary vector, and $\tau$ is an adjustable, slack parameter.

We now form the Feature matrix $F$ with yet a second LP program,

$\arg\min_{Z\geq 0}\parallel X-ZW\parallel_{\infty,1}$.

This entire problem is now readily addressable and solving using off the self convex-optimization libraries, such as Matlab and/or the cvxopt package in Python.  In our next post, we will look at some toy problems and how to solve them directly using these techniques.

#### Lagrangian form of Convex NMF

I don’t know about you, but I always prefer to see the Lagrangian, or dual, form of a convex optimization problem, just to make sure I understand the definition and the constraints.  Additionally, eventually we will want a high performance solver that we can run on very large data sets, so it is also very practical to review these equations now.

Bittorf et. al. provide an Incremental Gradient Descent algorithm where they dualize a part of the constraints.    The corresponding Lagrange multiplier problem is

$\mathcal{L}(C,\beta,\mathbf{\omega})=\mathbf{p}^{\dagger}C_{diag}+\beta(tr(C)-r)+\sum_{i=1}^{f}\omega_{i}\left(\parallel X_{i}-(CX)_{i}\parallel_{1}-\tau\right)$

with the constraint set

$\phi_{0}=\left\{C:C\geq 0,diag(C)\leq 1,C_{i,j}\leq C_{j,j}\forall i,j\right\}$

Since the  1-norm that appears above, this requires special techniques such as subgradients.  They suggest breaking the problem into two parts, a Dual Subgradient Ascent approach, that solves this problem incrementally by

1. alternating between minimizing the Lagrangian $\mathcal{L}$ over the constraint set $\phi_{0}$ to find a solution $C^{*}$, and then

2. plugging $C^{*}$ into a subgradient step with respect to the dual variables

We will go into more details in a future post.

#### Conclusion:  Clustering via LP

So there we have it.  A formulation of a convex NMF–a workhorse algorithm of applied machine learning–as a Linear Program (i.e. CNMF-LP), which can be solved using high performance, large scale convex optimization techniques.  This  has the potential to make clustering of very data scale data sets as easy as it is now to run very large scale linear SVMs.

It is, IMHO, a game changer.

#### References

1. Lee and Seung (1999), Learning the parts of objects by non-negative matrix factorization

2 Donoho & Stodden (2004), When does non-negative matrix factorization give a correct decomposition into parts?

3. Hoyer (2004), Non-negative Matrix Factorization with Sparseness Constraints

4. Boutsidis & Gallopoulos (2007), SVD based initialization: A head start for nonnegative matrix factorization

5. Ding, Li, & Jordan (2006), Convex and Semi-Nonnegative Matrix Factorizations

6. Bittorf, Recht, Re, & Troppy (2013), Factoring nonnegative matrices with linear programs

7. see: Python Matrix Factorization (PyMF) module  and references within

#### cloud-crawler-0.1

For the past few weeks, I have taken some time off from pure math to work on an open source platform for crawling the web.  I am happy to announce the cloud-crawler version 0.1  open source project

The cloud-crawler is a distributed ruby dsl for crawling the web using amazon ec2 micro-instances. The goal is to create an end-to-end framework for crawling the web, eventually including the ability to crawl even dynamic javascript, and from a pool of spot instances.

This initial version is built using Qless, a redis-based queue, a redis-based bloomfilter, and a re-implementation and extension of the anemone dsl.   It also includes chef recipes for spooling up nodes on the amazon cloud, and a Sinatra app, cloud-monitor, to monitor the queue.  the basic layout is shown below from the Slideshare presentation

Here, we show an example crawl, which finds links to all Level 1 Certs on the Crossfit main page:

urls = ["http://www.crossfit.com"]
CloudCrawler::crawl(urls, opts)  do |cc|
cc.focus_crawl do |page|
text_for(lnk) =~ /Level 1/i
end
end
cc.on_every_page do |page|
puts page.url.to_s
end
end

This is a very pre-release, and we are actively looking for contributors interested in getting involved.  (also, the web documentation is still in progress)

Rather than go into details, here we show how to install the crawler and get a test crawl up and running.

#### To install on a local machine

(i.e Mac or Linux, Ruby does not play well with Windows)

I. Dependencies

Ruby 1.9.3 with Bundler   http://gembundler.com

it is suggested to use RVM to install ruby  https://rvm.io

and to use git to obtain the source  http://git-scm.com

II.  Installation Steps

II.0  install ruby 1.9.3, and redis 2.6.x

II.1  install bundler

  gem install bundler

II.2 clone the git source

git clone git://github.com/CalculatedContent/cloud-crawler.git

II.3  install the required gems and sources

change directories to where the Gemfile.lock file is located

  cd cloud_crawler/cloud-crawler

install the gems and required source and build the gem

   bundle install

to create a complete, sandbox, you can say

  bundle install --path vendor/bundle

this will install the cloud_crawler in a local bundle gem repository

we use bundler locally because we also use this on amazon aws / ec2 machines

III. Testing the Install

III.1  start the redis server

  redis-server &

III.2  run rake

  bundle exec rake

III.3  run a test crawl

  bundle exec ./test/test_crawl.rb

IV  try a real crawl using the DSL

flush the redis database

  redis-cli flushdb

load the first job into redis

  bundle exec ./examples/crossfit_crawl.rb

run the worker job

  bundle exec ./bin/run_worker.rb -n crossfit-crawl

V.  To view the queue monitor in a browser

   bundle exec qless-web

this should launch a tab in the web browser if this fails, the monitor may still work , and  may be visible in your browser at

   localhost:5678

and that’s it — you have a dsl for crawling running locally.

VI.   To run the crawler on AWS and EC2, you will need to set up an amazon account,  install chef-solo, and create some security groups and s3 buckets.

Stay tuned for extended documentation and  examples, including  seeing the crawler in action on EC2.  Feel free to email to ask questions or to express interest in getting involved.

#### Introduction:

In a previous post we introduced the problem of detecting Gravity Waves using Machine Learning and suggested using techniques like Minimum Path Basis Pursuit.  Here, we drill down into the theoretical justifications of the general approach–called Compressed Sensing–and try to at least understand what the current theory actually tells us.

##### Linear Algebra: Solving Ax=b with an Over-Complete Basis

In traditional linear algebra, we solve the problem

$\mathbf{Ax}=\mathbf{b}$

using the Penrose Pseudo-Inverse

$\mathbf{x}=(\mathbf{A}^{t}\mathbf{A})^{-1}\mathbf{A}^{t}\mathbf{x}$

where $x$ is a vector of length $n$, and $\mathbf{A}$ is an m x n matrix

A standard and very old machine learning approach is to add a small constant $\lambda$ to the diagonal, called  Tikhonov Regularization or Ridge Regression

$\mathbf{x}(\lambda)=(\mathbf{A}^{t}\mathbf{A}-\lambda\mathbf{I})^{-1}\mathbf{A}^{t}\mathbf{x}$

This is equivalent to constrained optimization problem

$\min\Vert\mathbf{Ax}-\mathbf{b}\Vert_{2}+\lambda\Vert\mathbf{x}\Vert_{2}$

##### Basis Pursuit — or L1 Regularization

It has been known in geophysics and astrophysics since the 1970s that frequently we can do better if we minimize the $l1$-norm instead of the  $l1$-norm of $\mathbf{x}$

$\min\Vert\mathbf{Ax}-\mathbf{b}\Vert_{2}+\lambda\Vert\mathbf{x}\Vert_{1}$

where

$\Vert\mathbf{x}\Vert_{1}=\sum_{i=1}^{N}|x_{i}|$

Note: sometimes this is written as an unconstrained optimization problem

$\min\Vert\mathbf{x}\Vert_{1}$ subject to $\mathbf{Ax}=\mathbf{b}$

appland this can be relaxed to

$\min\Vert\mathbf{x}\Vert_{1}$ subject to $(\mathbf{Ax}-\mathbf{b})\le\epsilon$

The constrained solution, and related $l_{1}$-regularization problems, are readily solved using numerical techniques, and are available, in some form, in many R and MatLab packages and open source tools.

##### Liblinear: (L1 Regularized Classification)

The current version of Liblinear (1.9) includes $l_{1}$-regularization for classification in (along with $l_{2}$-regularization SVM regression methods ):

./liblinear-1.91/train
Usage: train [options] training_set_file [model_file]
options:
-s type : set type of solver (default 1)
...
5 -- L1-regularized L2-loss support vector classification
...

To solve a regression, we can use

##### Vowpal Wabbit (VW):

VW actually solves the more general problem (i.e. an Elastic Net)

$\min\mathcal{L}+\lambda_{1}\Vert\mathbf{x}\Vert_{1}+\lambda_{2}\Vert\mathbf{x}\Vert_{2}$

where $\mathcal{L}$ is a general Loss function for either classification or regression.  We set the regularization parameters via the command line Update Rule Options

--l1 arg (=1)                    l_1 lambda
--l2 arg (=1)                    l_2 lambda



There even hardware implementations of compressed sensing: see the awesome Nuit Blanche blog for more details.

(Apparently) when solving a regression problem, we call the $l_{1}$-regularization problem Basis Pursuit (BP) [7]

BP is a principle for decomposing a signal into an optimal superposition of dictionary elements (an over-complete basis), where optimal means having the smallest $l_{1}$-norm of coefﬁcients among all such decompositions.

Unfortunately, neither of these off-the-shelf programs will work to find our Gravity Waves; for that, we need to add path constraints.

Our signal is very weak (SNR << 1) and we are trying to reconstruct a continuous, differentiable function of a known form.  We  will be push the computational and theoretical limits of these methods.  To get there (and because it is fun), we highlight a little theoretical work on Compressed Sensing

Compressed Sensing

In the past decade, work by Dave Donoho and Emmanuel Candes at Stanford and Terence Tao at Berkeley have formalized and developed the theoretical and practical ideas of Basis Pursuit under the general name compressed sensing.

A common use of compressed sensing is for image reconstruction

#### Beyond the Sampling Theorem

To reconstruct a signal $\mathbf{x}$, we used to think we were bandwidth limited; this is the essence of the Nyquist–Shannon Sampling Theorem

Tao, a former child prodigy who won the Fields Medal in 2006, took some time off from pure math to show us that we are, in fact, limited by the signal structure, not the bandwidth.

We call $\mathbf{x}$ S-sparse when at most only $latex S$ of the coefﬁcients $x_{i}$ can be non-zero ($S.   For example, many images are S-sparse in a wavelet basis; this is the basis of the newer JPEG2000 algorithm.

This allows us to reconstruct a signal with as few data as possible–if we can guess the right basis.

##### Ideal Sparse Reconstruction

In an ideal world, we would like to find the most sparse solution, which means we solve the $l_{0}$ regularization problem

$\min\Vert\mathbf{x}\Vert_{0}$  subject to $(\mathbf{Ax}-\mathbf{b})\le\epsilon$

where the $l_{0}$ norm is just the number of non-zero components.

This is very difficult to achieve numerically.  It turns out that we can approximate this problem with the $l_{1}$ regularization problem, which can be solved as a convex optimization problem.   And we even have some ideas why. But why ask why?

IMHO, applied machine learning work requires applying off-the-shelf tools, even in seemingly impossible situations, and yet avoiding wasting effort on hopeless methods.  The best applied scientists understand the limits of both the methods they use and the current underlying theory.

#### Theoretical Justifications for Compressed Sensing

Any early idea showed that we can, in fact, do better than the Sampling Theorem just using random projections:

##### Theorem (Candes-Romber  2004):

Suppose we choose m points $[x_{i}|i=1\ldots m]$ randomly out of n.  Then, with high probability, every S-sparse signal $f(\mathbf{x}_{n})$  can be reconstructed from $[f(x_{i})|i=1\ldots m]$, so long as $m>CS\log n$ for some absolute constant C.

This result leads hope that this can be formalized.  But what kind of conditions do we need on $\mathbf{A}$ to at least have a hope of reconstructing $\mathbf{x}$?  A very simple result is

##### Proposition:

Suppose that any $2S$ columns of matrix $\mathbf{A}$ are linearly independent. (This is a reasonable assumption once $m<2S$.) Then, any S-sparse signal $\mathbf{x}$ can be reconstructed uniquely from $\mathbf{Ax}$.

##### Proof:

Suppose not; then there are two S-sparse signals $\mathbf{x},\mathbf{x'}$ with $\mathbf{Ax}=\mathbf{Ax'}$.  This implies $\mathbf{A}(\mathbf{x}-\mathbf{x'})$ But $\mathbf{x}-\mathbf{x'}=0$ is 2S-sparse, so there is a linear dependence between 2S columns of $\mathbf{A}$.  A contradiction.

So we might expect, on a good day. that every 2S columns of A are linearly independent.   In fact, we can say a little more

##### Conditions on A and the Restricted Isometry Property (RIP)

There are now several theoretical results ensuring that Basis Pursuit works whenever the measurement matrix $\mathbf{A}$ is sufﬁciently “incoherent”, which roughly means that its matrix entries $A_{i,j}$ are uniform in magnitude.

In practice, numerical experiments suggest that most S-sparse signals can be recovered exactly when $m\gtrsim4S$

Spectrum of a Gaussian Random Matrix

This means we can strengthen the proposition above by requiring, for example, that every 4S columns of $\mathbf{A}$ are approximately orthonormal.  It turns out that many well known Random Matrices satisfy this property, such as Gaussian Random Matrices.

##### Technically, the RIP is stated as:

Suppose that there exists a constant $\delta_{s}$  such that, for every m × s submatrix $\mathbf{A_s}$ of $\mathbf{A}$ and for every vector $\mathbf{x}$ we have

$(1-\delta_{s})\Vert\mathbf{x}\Vert^{2}_{2}\le\Vert\mathbf{A_{s}x}\Vert^{2}_{2}\le(1+\delta_{s})\Vert\mathbf{x}\Vert^{2}_{2}$

whenever $\Vert\mathbf{x}\Vert_{0}\le s$

If the matrix $\mathbf{A}$ satisfies some form of RIP, we can get a theoretical upper bound on how well the $l_{1}$ BP will perform.    A basic result shows that true $l_{2}$ (i.e. RMS) error in the reconstructed signal is bounded by the $l_{1}$ (i.e. absolute) sample error:

##### Theorem (Candes-Romber-Tao):

$\Vert\mathbf{x^{*}}-\mathbf{x}\Vert_{2}\le C\dfrac{\Vert\mathbf{x^{*}}-\mathbf{x_{s}}\Vert_{1}}{\sqrt{s}}+C\epsilon$

The tightest bound is due to Foucart, who shows that this bound holds when $(C,\epsilon)=(2s, 0.4652)$

The astute reader will recognize that the RIP asks that the signal be sparse in an orthonormal basis and that the data matrix be uniformly incoherent.

Is this good enough?  Have you ever seen a uniformly incoherent distribution in a real world data set?

##### D-RIP

Many signals may in-fact, be only sparse in the non-orthogonal, overcomplete basis.   Recently, it has been shown how to define a D-RIP property [10,11] that applies to the more general case of even when the matrix $\mathbf{A}$ is not as incoherent as we might expect or desire.

In some earlier posts, we introduced the Regularization (or Projection) Operator as a way to get at What a Kernel is (really). Recall that to successfully apply a Kernel, the associated, infinite order expansion should converge rapidly.

Likewise, here we define a similar Operator, $\mathcal{D}$, that projects our sample data into the Hilbert Space, or Signal Domain $D$.

It is in this space that the signal is expanded in the over-complete basis

$\mathcal{D}\mathbf{x}=\sum_{d}\phi_{d}\mathbf{x}$

We might also refer to $D$ as a frame, although I skipping this technical detail for now.  Essentially, this means that any  infinite order expansion in $\phi_{d}$ converges rapidly enough that we can use it as a basis in our infinite dimensional space.

We can now adjust the RIP, extending the notion of considering every m × s submatrix to considering $\Sigma_{s}$, the union of all subspaces spanned by all subsets $D$ of size s.

##### Technically, the D-RIP is stated as:

Suppose that there exists a constant $\delta_{s}$  such that

$(1-\delta_{s})\Vert\mathbf{x}\Vert^{2}_{2}\le\Vert\mathbf{Ax}\Vert^{2}_{2}\le(1+\delta_{s})\Vert\mathbf{x}\Vert^{2}_{2}$

whenever $\mathbf{x}\in\Sigma_{s}$

Similarly to the RIP, Guassian, subgaussian, and Bernoulli matrices satisfy the D-RIP with m ≈ s log(d/s). Matrices with a fast multiply (DFT with random signs) also satisfy the D-RIP with m approximately of this order.

There is also a known bound.

##### D-RIP Bound for Basis Pursuit (Needell, Stanford 2010):

Let D be an arbitrary tight frame and let A be a measurement matrix satisfying D-RIP with δ2s < 0.08. Then the solution $f$ to $l_{1}$-analysis satisﬁes

$\Vert\mathcal{D}f^{*}-\mathcal{D}f\Vert_{2}\le C_{0}\dfrac{\Vert\mathcal{D}f^{*}-\mathcal{D}f_{s}\Vert_{1}}{\sqrt{s}}+C_{1}\epsilon$

The result says that $l_{1}$-regularization/analysis is very accurate when $\mathcal{D}f$ converges quickly (has rapidly decaying coefficients)

This is the case in applications using Wavelets, Curvelets, and Chirplets.

#### Warning: Don’t just use Random Features

This does not say that we can use any random or infinite size basis.  In particular, if we combine 2 over-complete basis sets, we might overtrain or, worse, fit a spurious, misleading signal.  Been there, seen that.  This is way too common in applied work, and is also the danger we face in trying to detect a weak signal in a sea of noise.  And this is why we will add additional constraints.

#### Weak Signal Detection: A Preview

Recall that we wish to detect a very weak signal of the general form

$f(t,t_{c)}=A(t-t_{c})e^{i\phi(t-t_{c})}$

in a sea of noise,  where $A(t)$ is a slowly varying function, and $\phi(t)$ is some oscillatory function of unknown frequency, phase, and envelope.  We need to detect the presence of $f(t,t_{c})$and the unknown critical time $t_{c}$.  We might be tempted to first think we can just create an overcomplete basis of functions $g_{\lambda}(t,\tau)$  for variety of critical times $\tau\in[t_{c}|c=1\ldots N_{c}]$ and solve the Basis Pursuit problem.

##### This will most likely fail for 2 reasons
1. each critical time $\tau=t_{c}$ defines a unique $\mathcal{D}$-frame, and we probably should not combine all $N_{c}$ basis sets / frames $latex g_{\lambda}(t,\tau)$ in the same optimization
2. even for a single pass of BP,  there is just too much incoherent noise and/or perhaps even some other, transiently stable, detectable, but spurious signal.

Given these conditions,   one might expect a $l_{1}$ clever, adaptive or matching pursuit-like approach to work better than vanilla constrained optimization (I’ll explain the difference in a bit).  Recent research [12] indicates, however, that while we might believe

Folk Theorem:  The estimation error one can get by using a clever adaptive sensing scheme is far better than what is achievable by a nonadaptive scheme.

that, in fact,

Surprise:  The folk theorem is wrong in general. No matter how clever the adaptive sensing mechanism, no matter how intractable the estimation procedure, in general [one can not do better than] a  random projection followed by $l_{1}$ minimization.

with the

Caveat:  ”This ‘negative’ result should not conceal the fact that adaptivity may help tremendously if the SNR [Signal-to-Noise Ratio]  is sufficiently large

In our next post we will look at some real world examples of Chirplet Minimum Path Basis Pursuit using various path (and maybe even structural) constraints…stay tuned!

#### References

[1] 1995 Mann & Haykin, The Chirplet Transform: Physical Considerations

[2] 2004 Emmanuel Candes &  Terence Tao,  Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?

[3] 2006  Candes, Charlton, & Helgason, Detecting Highly Oscillatory Signals by Chirplet Path PursuitThe Chirplet Transform: Physical Considerations

[4] 2008 Greenberg & Gosse, Chirplet approximation of band-limited, real signals made easy

[6] 2009 Tao, Compressed Sensing

[7] Chen & Donoho, Basis Pursuit

[8]  2004  Candes, Chirplets: Multiscale Recovery and Detection of Chirps

[12] 2012 Arias-Castro, Candes, & Davenport,  On the Fundamental Limits of Adaptive Sensing

Recently , 7 Italian Scientists have been sentenced in prison for manslaughter for failing to predict an Earthquake in 2009 !

So how in the world would a Machine Learning Scientist predict an Earthquake?  You might probably think, just collect all the data you can find, stuff it into Hadoop, and run some supervised machine learning algorithms.  Eh…not so much!

What we will do is apply some models from Astronomy and Theoretical Physics to model the process, and see if the techniques developed for detecting weak patterns in Astronomy can be applied to the problem of detecting Earthquakes and other crashes in nature.

Moreover, we will, eventually, see how  to convert a highly non-convex optimization problem into a convex  (LP) problem.  It is going to take me some time to get there…first some motivation

#### Scale Invariance in Nature:

When we look out at natural phenomena, we see a system of Fractals (and seemingly in Equilibrium).   Indeed, many natural phenomena resemble Fractals, such as Trees, Mountains, and Coastlines.

A famous math problem is to compute the length of the Coastline of Britain.  The problem is that the shorter your ruler, the longer the coastline seems.  If we use a 200km ruler, we measure 2400 km.  If we use a 50km one, we get 3400 km.  And so on

When this happens, we call this, mathematically, Scale Invariance. Read more…

Today we consider the problem of modeling periodic behavior in a noisy time series.  This is an old problem from Astrophysics that a has taken center stage again because of it is used to   detecting planets orbiting around distant stars.  As usual, it is also important in modeling many other real world phenomena.  We will prep the discussion with some simple examples and a brief review of traditional Fourier Analysis, and then look at the classic solution called the Lomb-Scargle Periodogram.  We will look at an example using the Python AstroML toolkit [1].  Later (next blog) we will will then look at in detail at some modern machine learning style solutions such as Basis Pursuit  and Compressed Sensing.  Finally, we use this approach to look at a hard problem–how to detect the onset of Discrete Scale Invariance and Log-Periodic behavior in a noisy time series.  Stay tuned

#### Simple Analysis

We first consider a simple problem: Do Werewolves only come out during a Full Moon?

Suppose that every so often we observe some werewolves and we note this down in our diary.  Let us plot this data after a year of recording.

We suspect that the Werewolves only come out during a full moon.  How much confidence do we have in this?    Our goal is to determine if the distribution of Werewolves sightings is periodic, or if it is  random.

As pointed out by my friend Mike Plotz, totally random sightings would most be governed by  a Poisson process, meaning that the arrival times $t_{i}-t_{j}$ would be exponentially distributed.   So we should plot the distribution of arrival times and check.

Although recent research in biophysics suggest that random Werewolves would actually exhibit scale invariance [10].  We will look at this in our next post.  Here, we look at a statistic that is also suitable for arrival times…

#### Rayleigh Power Statistic

The first thing we can do is simply check a simple, unbinned statistic for a given frequency, called the Rayleigh Power Statistic $S(\omega)$

$S(\omega)=\dfrac{1}{N}\left[\sum_{j=1}^{N}\cos(\omega t_{j})\right]^{2}+\left[\sum_{j=1}^{N}\sin(\omega t_{j})\right]^{2}$

By unbinned, we mean that we simply calculate the statistic for all data points without computing any histograms or using any kind of sampling.

What does this statistic measure?  Suppose our times $t_{j}$ are uniformly distributed (i.e. werewolves come out at random).  Then $S(\omega)=1$    We can compute a confidence of periodicity by using the statistic for a uniform distribution

$1-P(S(\omega)>k)=1-e^{-S(\omega)}$

For this data set, S=8167 for N=885, and $\omega=28$ days, our confidence is

$1-e^{-\dfrac{8167}{885}}=1-10^{-4}\gg 1$

So we have a very high confidence that werewolves do come out every 28 days.

What if we did not know the exact frequency $\omega$?  No problem, we can just plot $S(\omega)$ vs $\omega$ and look for the peaks

So why not just do this?  Where does this break down?

• does not work for binned / sampled data
• may perform poorly if the data is not evenly sampled?
• not entirely clearly how to handle multiple frequencies

Given a signal $x(t)$,  we are frequently interested in the power spectrum or energy signal density

$P(\omega)=\dfrac{1}{2\pi}\mid\int_{-\infty}^{\infty}x(t)e^{i\omega t}dt\mid^{2}=\dfrac{1}{2\pi}\mathcal{F}^{*}(\omega)\mathcal{F}(\omega)$

where  $\mathcal{F}(\omega)$ is the continuous Fourier Transform of $f(t)$

We usually would approximate $\mathcal{F}(\omega)$ with the discrete-time Fourier Transform (DFT)

$\mathcal{P}_{x}(\omega)=\dfrac{1}{2\pi}\mid\sum_{n=-\infty}^{\infty}x_{n}e^{i\omega t}\mid^{2}$

although typically to obtain a good result we need to sample $x_{n}=x(t_{n})$ on equally spaced  points $n$

We then can now express the   Periodogram

$P_{x}(\omega)=\dfrac{1}{N}\left[\left[\sum_{j}x_{j}\cos(\omega t_{j})\right]^{2}+\left[\sum_{j}x_{j}\sin(\omega t_{j})\right]^{2}\right]$

which is clearly the binned version of the Rayleigh Power Statistic, suitable for multiple frequencies.

We can express the Periodogram as the solution of the Least Squares (LS) optimization problem

$P_{x}(\omega)=\|\hat{\beta(\omega)}\|^{2}$, where

$\hat{\beta(\omega)}=\min_{\beta(\omega)}\sum_{j=1}^{N}\|x(t_{j})-\beta e^{-i\omega t_{j}}\|^{2}$

Because our data is real, however, this expression is really

$\sum_{j=1}^{N}\|x(t_{j})-\beta(\omega)\cos(\omega t_{j}+\phi(\omega))\|^{2}+|\beta(\omega)|^{2}\sum_{j=1}^{N}\sin^{2}(\omega t_{j}+\phi(\omega))$

We see that LS problem includes a spurious term that does not depend on $x(t_{j})$.

Consequently, in noisy time series, with uneven sampling, the standard Fourier approach will boost long-periodic noise.  We need a different, more robust approach.

#### LSSA: Least Squares Spectral Analysis

What is LLSA?   This approach is similar to the Periodogram above in that we formulate the problem as a LS optimization.  We replace the spurious term above, however, by including an adjustable parameter $\tau$ in the cosine basis functions.   Consider the following constrained minimization problem

$\min_{\beta,\phi}f_{\omega}(\beta,\phi)=\sum_{n=1}^{N}\left[x(t_{n})-\beta\cos(\omega(\tau-t_{n})+\phi)\right]^{2}$

where $\alpha\ge0$ and $\beta\in[0,2\pi]$, and we constrain $\tau$ so that all the frequency components are mutually orthogonal  at the sample times $t_{j}$

$\sum_{j=1}^{N}\cos(\omega(\tau-t_{j}))\sin(\omega(\tau-t_{j}))=0$

Rather than sample $x(t)$ on regular time intervals,  and look at all frequency components, we will use regression and / or machine learning techniques to find  the optimal $(\omega, \tau)$

We now can apply some old trigonometric identities.  The constraint on $\tau$ can be expressed as

$\tan(2\omega\tau)=\dfrac{\sum_{n}\sin(2\omega t_{n})}{\sum_{n}\cos(2\omega t_{n})}$

and we can transform the overall function minimization to a constrained linear optimization problem using the identity

$\cos(x+y)=\cos(x)\cos(y)-\sin(x)\sin(y)$

$cos(\omega(\tau-t_{n})+\phi)=cos(\omega(\tau-t_{n}))\cos(\phi)+sin(\omega(\tau-t_{n}))\sin(\phi)$

Let us now define the linear equation

$\mathbf{x}=\mathbf{A}\mathbf{\Phi}$, where

$\mathbf{x}=\left[\begin{array}{c} x_{1}\\ x_{2}\\ \ldots\\ x_{N} \end{array}\right]$$\mathbf{A}=\left[\begin{array}{cc} \cos(\omega(\tau-t_{1})) & \sin\omega(\tau-t_{1}))\\ \cos(\omega(\tau-t_{2})) & \sin\omega(\tau-t_{2}))\\ \ldots & \ldots\\ \cos(\omega(\tau-t_{N})) & \sin\omega(\tau-t_{N})) \end{array}\right]$$\Phi=\left[\begin{array}{c} \beta\cos(\phi)\\ \beta\sin(\phi) \end{array}\right]$

This is now a linear constrained problem, normally solved by Ordinary Least Squares (OLS).  We choose a quadratic loss function to illustrate, but, later, we will see we can also solve this problem as a Machine Learning optimization using an $L_{1}$ norm or any number of methods.  Using OLS, we obtain

$\min f(\mathbf{\Phi})=\hat{\mathbf{\Phi}}=(\mathbf{A}^{t}\mathbf{A})^{-1}\mathbf{A}^{t}\mathbf{x}$

We can now express the Power Spectrum, or Periodgram,  as

$P_{x}(\tau,\omega)=\dfrac{1}{N}\hat{\mathbf{\Phi}}^{t}\mathbf{A}^{t}\mathbf{A}\hat{\mathbf{\Phi}}$

#### Lomb-Scargle Periodgram

Without going into all the details yet, we can now express the spectral density $P_{x}(\omega)$ in closed form:

$P_{x}(\tau,\omega)=\dfrac{1}{2\sigma^{2}}\left(\dfrac{[\sum_{j}(x_{j}-\bar{x})\cos\omega(\tau-t_{j})]^{2}}{\sum_{j}X_{j}\cos^{2}\omega(\tau-t_{j})}\right)+\left(\dfrac{[\sum_{j}(x_{j}-\bar{x})\sin\omega(\tau-t_{j})]^{2}}{\sum_{j}X_{j}\sin^{2}\omega(\tau-t_{j})}\right)$

where $\bar{x}=\dfrac{1}{N}\sum^{N}_{i=1}x_{i}$ and $\sigma=\dfrac{1}{N-1}\sum^{N}_{i=1}(x_{i}-\bar{x})$  are the empirical mean and standard deviation, resp.

I will work out the proof soon

#### Extensions: Weighted Least Squares Spectral Analysis

When  we have the covariance of our sample data, or some other model of the noise, we can include this into the OLS problem (see [4]).  For example, we model the covariance at each frequency $\omega_{k}$ as

$\mathbf{Q_{k}}=\sum_{j\ne k}\mathbf{A}_{j}\mathbf{D}_{j}\mathbf{A^{t}}_{j}$,

where $\mathbf{D}_{k}$ is the diagonal matrix

$\mathbf{D}_{k}=\dfrac{a^{2}(\omega_{k})+b^{2}(\omega_{k})}{2}\left[\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\right]$

We may plug in any estimate of the covariance $\mathbf{Q}$ to the OLS problem–as long as it is not ill-conditioned–yielding the new quadratic minimization functional

$\left[\mathbf{x}-\mathbf{A\Phi}\right]^{t}\mathbf{Q}\left[\mathbf{x}-\mathbf{A\Phi}\right]$

This leads to an improved  expression for $\Phi$

$\mathbf{\Phi_{Q}}=(\mathbf{A}^{t}\mathbf{Q}^{-1}\mathbf{A})^{-1}\mathbf{A}^{t}\mathbf{Q}^{-1}\mathbf{x}$

but the Periodgram stil has the form of an expectation value of $\mathbf{A^{t}A}$

$P(\omega,\tau)=\mathbf{\Phi_{Q}}\mathbf{A}^{t}\mathbf{A}\mathbf{\Phi_{Q}}$

This should also lead to a similar, analytic expression for the Lomb-Scargle Periodgram [to come soon, see 11 and references therein]

#### Issues with LSSA

We note that neither FFT nor LLSA approaches are robust when the number of samples is small, when the sampling is uneven, or when the time between samples is longer than the period being sampled $(t_{i}-t_{i+1})\gg\dfrac{1}{\omega}$ See [11]. This case  arises frequently in Astronomy, and needs to be handled using  Machine Learning (or other) techniques such as a Maximum Likelihood Approach to the Rayleigh Power analysis (see [3]), or a Generalized Lomb-Scargle Periodogram [11]

#### Generalized Lomb-Scargle Periodogram:  Weights and Offsets

There are several modern research papers [3],[4] (more to come) discussing the success and limitations of LLSA and its variants.  Here we describe the variant provided in the AstroML package (below, to come)

#### Example:  AstroML Package

The AstroML package [1] is very new and is built upon the very popular Scikits Learn Machine Learning library for Python.  AstroML implements both the classical Lomb Scargle Periodogram and the Generalized version [11].  Here is an example

Here we have generated random data distributed around the function

$y=10.0+\sin(2\pi\omega t)$, where $\omega=\dfrac{1}{0.3}$

and we compute the Periodogram for periods $\dfrac{1}{\omega}\in(0,1)$.   We see the Periodgram is quite noisy, but there is a dominant peak at $\dfrac{1}{\omega}=0.3$.

Can we do better?

#### Machine Learning Alternatives

Just a hint of whats to come. Using traditional (odl school) machine learning / Kernel SVM techniques, one can model the covariances and/or other noise by using a non-linear, kernel-SVM (i.e. [9]) But we note that we can also recast the quadratic optimization problem as a linear programming problem, called Basis Pursuit

$\min_{\Phi}\|\mathbf{\Phi}\|_{1}$ subject to $\mathbf{x}=\mathbf{A}\mathbf{\Phi}$

or as a convex optimization, sometimes called Basis Pursuit Denoising

$\min_{\Phi}\|\mathbf{x}-\mathbf{A}\mathbf{\Phi}\|_{2}+\lambda\|\mathbf{\Phi}\|_{1}$

When would such approached be useful? The LSSA method would clearly be deficient when the OLS problem is pathological, or

$\int_{0}^{\omega_{max}}\mathbf{A}^{t}\mathbf{A}d\omega$

is  ill-conditioned.    (See the Appendix of reference [4] for more details).

In particular, when the LSSA problem is underdetermined, then we need advanced techniques.  Which leads us to the next blog, where we will discuss the modern variants of these old AstroPhysics and Geophysics methods–called  Compressed Sensing.  Stay Tuned.

#### References

[9] Chapter 5,  Kernel Methods and Their Application to Structured Data

[11]  2009 M. Zechmeister &M. K¨urster, The generalised Lomb-Scargle periodogram: A new formalism for the ﬂoating-mean and Keplerian periodograms

#### Books and Review Articles

Advances in Machine Learning and Data Mining for Astronomy

Data Mining and Machine-Learning in Time-Domain Discovery & Classiﬁcation