(this post has been published to check the links–sorry it is not ready yet)
In our previous post, we discussed how to distinguish a correlation from a causation, looking a technique developed by economists called Granger Causality. Here, take a more pedagogic look at the details, from the physics perspective, and another Nobel prize winner
Stochastic Problems in Astronomy and Physics
There is a classic review of Stochastic Problems in Astronomy and Physics [1] by the late Subrahmanyan Chandrasekhar, Nobel Laureate (Physics, 1983) and former head of the department of physics when I was student at the University of Chicago. In particular, Chandrasekhar reviews 2 kinds of processes:
- random walk , aka Wiener process
- mean reverting walk, aka Ornstein-Uhlenbeck process
I am a firm believer that every scientist should study and know the great works of the Nobel Laureates, so I strongly encourage the reader to dig deep [1] on their own–or to at least continue reading
Recall that we wish to distinguish between causation and correlation. In this context, we shall mean that we have an underlying model for our times series, and we wish to know if our model is sensible and predictive – it is an underlying cause – or we have been ‘fooled by randomness‘… we see patterns that are not really there.
random walk
mean reverting process
In this sense, we seek to distinguish between a correlation and causation by distinguishing between having a random walk or a mean-reverting process.
Tests for Mean Reversion
How might we determine if we have a random walk or a mean reverting processes? Let us try to express our time series as
Is this a good model? Do data points in the past () say anything about data points in the future (
) ?
Augmented Dickey-Fuller (ADF) test
Estimate the regression coefficient of
on
. If the coefficient
is significantly below 1, it means that the process is mean reverting; if it is close to 1, the process is a random walk.
Phillips-Perron test
Search for a unit root in the equation linking to
. A high p-value reinforces the hypothesis of a unit root.
References
Today I am going to look at a very important advance in one of my favorite Machine Learning algorithms, NMF (Non-Negative Matrix Factorization) [1].
NMF is a curious algorithm in that it allows us to do something very natural and seemingly straightforward, namely, to find clusters of data points, and yet, mathematically, it is known to be non-convex and NP-HARD [2]. Practically, this means that NMF–and many other common clustering algorithms, such as KMeans, LDA, etc.–can suffer from convergence problems and yield unstable and/or just plain unphysical results.
Several approaches exist to improve traditional NMF [1], such as sparse NMF ala Hoyer [3], using SVD initialization (i.e. NNDSVD) [4], and the convex NMF (CNMF) approach suggested by Jordan et. al. [5]
Recently, however, it has been shown how to re-formulate the traditional NMF problem, under fairly reasonable assumptions, as a true convex optimization problem, namely a linear programming (LP) problem [6]. Is this new? Is it revolutionary? Yes and no.
In fact, it has been known for some time that CNMF could be re-formulated as purely convex optimization, called Convex-hull non-negative matrix factorization (CHNMF). Indeed, both CNMF and CHNMF are implemented in the Python Matrix Factorization (PyMF) module. The PyMF package, however, is not very fast, and in production, I would recommend using the more popular SciKit Learn NMF, with an NNDSVD seed, over the PyMF CNMF or CHNMF implementations, unless you really, really need a (more) sparse NMF solution.
So what is so exciting about the Bittorf et. al. paper? Their ’CNMF-LP’ problem can be solved using a very fast, shared-memory, lock-free implementation of a Stochastic Conjugate Gradient (SGD) solver, called hottopix. (This is same kind of solver applied in other very high performance ML toolkits, such as the fantastic GraphLab package) IMHO, this is HUGE. It means we can stop wasting time with half-baked approaches like Hadoop+Mahout and actually solve very large scale, applied problems with the same performance we have come to expect from our SVMs.
I will break this post into a 3-4 parts. In this first post, I review the basic re-formulation of convex-NMF as a linear program.
Standard NMF problem
Let be a nonnegative
data matrix with the
rows indexing the features and the
columns indexing the examples. That is,
where is a column, data vector. In standard NMF, we seek to factor
into
components, such that
, where
is an
feature matrix, and
an
weight matrix, and both
are non-negative matrices.
Convex NMF (NMF) Problem
In Convex NMF (Jordan et. al.), we require that the feature vectors be convex linear combinations of the original data matrix, such that
This lets us relate the feature vectors to
centroids of the data, as is common in Kmeans methods. Also, this leads to very sparse solutions. Lets us write
.
where is an
matrix, called the factorization localizing matrix. This lets express convex NMF approximation in the more cryptic form
which is the starting point for grokking the LP formulation
Notational Confusion: from CNMF to ‘CNMF-LP’
We make a short side-note on notation.
The Jordan et. al. paper calls the data matrix and writes the factorization as
,
and
.
The Bittorf et. al. paper writes the data matrix the but adds the additional pre-factorization as
The matrix is the true data matrix, and the
matrix is that part which satisfies the constraints to make the problem convex. They assume
is small a in formal sense in order to make their theoretical argument and form an error bound (not addressed here). They then write the factorization almost interchangeably as
,
and / or
.
Also, note that they write the matrix on the left:
,
and / or
.
(switching between C on the left or right makes it hard to compare the details in the 2 CNMF papers, so we only sketch out the ideas here, and will work through the explicit details for you in the next post when we code it up)
Exact Convex NMF and Linear Programming
Certainly there are cases where NMF is well posed. That is, the data is sparse enough and the centroids separable enough that even though the Jordan et. al. convex NMF per-se is non-convex and probably NP-HARD, any practical solution with a good initial starting value will rapidly converge to a stable and unique solution.
The Bittorf et. al. observe that, by definition, for any well-posed convex NMF, the matrix elements of the factor loading matrix form a polyhedral set
, such that
To solve an exact and well posed convex NMF problem, one simply needs to find a feasible solution of the set , such that all the diagonal elements
. This is readily accomplished by solving the associated linear programming problem!
Real-World Convex NMF
To solve a real-world problem, we assume that the data matrix is nearly well-posed, upto some small amount of noise. We then solve the associated LP problem, where we replace the exact constraint
with the relaxed constraint
. This yields
subject to the constraints
;
;
; and
;
where is some arbitrary vector, and
is an adjustable, slack parameter.
We now form the Feature matrix with yet a second LP program,
.
This entire problem is now readily addressable and solving using off the self convex-optimization libraries, such as Matlab and/or the cvxopt package in Python. In our next post, we will look at some toy problems and how to solve them directly using these techniques.
Lagrangian form of Convex NMF
I don’t know about you, but I always prefer to see the Lagrangian, or dual, form of a convex optimization problem, just to make sure I understand the definition and the constraints. Additionally, eventually we will want a high performance solver that we can run on very large data sets, so it is also very practical to review these equations now.
Bittorf et. al. provide an Incremental Gradient Descent algorithm where they dualize a part of the constraints. The corresponding Lagrange multiplier problem is
with the constraint set
Since the 1-norm that appears above, this requires special techniques such as subgradients. They suggest breaking the problem into two parts, a Dual Subgradient Ascent approach, that solves this problem incrementally by
1. alternating between minimizing the Lagrangian over the constraint set
to find a solution
, and then
2. plugging into a subgradient step with respect to the dual variables
We will go into more details in a future post.
Conclusion: Clustering via LP
So there we have it. A formulation of a convex NMF–a workhorse algorithm of applied machine learning–as a Linear Program (i.e. CNMF-LP), which can be solved using high performance, large scale convex optimization techniques. This has the potential to make clustering of very data scale data sets as easy as it is now to run very large scale linear SVMs.
It is, IMHO, a game changer.
References
1. Lee and Seung (1999), Learning the parts of objects by non-negative matrix factorization
2 Donoho & Stodden (2004), When does non-negative matrix factorization give a correct decomposition into parts?
3. Hoyer (2004), Non-negative Matrix Factorization with Sparseness Constraints
4. Boutsidis & Gallopoulos (2007), SVD based initialization: A head start for nonnegative matrix factorization
5. Ding, Li, & Jordan (2006), Convex and Semi-Nonnegative Matrix Factorizations
6. Bittorf, Recht, Re, & Troppy (2013), Factoring nonnegative matrices with linear programs
7. see: Python Matrix Factorization (PyMF) module and references within
cloud-crawler-0.1
For the past few weeks, I have taken some time off from pure math to work on an open source platform for crawling the web. I am happy to announce the cloud-crawler version 0.1 open source project
The cloud-crawler is a distributed ruby dsl for crawling the web using amazon ec2 micro-instances. The goal is to create an end-to-end framework for crawling the web, eventually including the ability to crawl even dynamic javascript, and from a pool of spot instances.
This initial version is built using Qless, a redis-based queue, a redis-based bloomfilter, and a re-implementation and extension of the anemone dsl. It also includes chef recipes for spooling up nodes on the amazon cloud, and a Sinatra app, cloud-monitor, to monitor the queue. the basic layout is shown below from the Slideshare presentation
Here, we show an example crawl, which finds links to all Level 1 Certs on the Crossfit main page:
urls = ["http://www.crossfit.com"] CloudCrawler::crawl(urls, opts) do |cc| cc.focus_crawl do |page| page.links.keep_if do |lnk| text_for(lnk) =~ /Level 1/i end end cc.on_every_page do |page| puts page.url.to_s end end
This is a very pre-release, and we are actively looking for contributors interested in getting involved. (also, the web documentation is still in progress)
Rather than go into details, here we show how to install the crawler and get a test crawl up and running.
To install on a local machine
(i.e Mac or Linux, Ruby does not play well with Windows)
I. Dependencies
Ruby 1.9.3 with Bundler http://gembundler.com
Redis 2.6.x (stable) http://redis.io/download
it is suggested to use RVM to install ruby https://rvm.io
and to use git to obtain the source http://git-scm.com
II. Installation Steps
II.0 install ruby 1.9.3, and redis 2.6.x
II.1 install bundler
gem install bundler
II.2 clone the git source
git clone git://github.com/CalculatedContent/cloud-crawler.git
II.3 install the required gems and sources
change directories to where the Gemfile.lock file is located
cd cloud_crawler/cloud-crawler
install the gems and required source and build the gem
bundle install
to create a complete, sandbox, you can say
bundle install --path vendor/bundle
this will install the cloud_crawler in a local bundle gem repository
we use bundler locally because we also use this on amazon aws / ec2 machines
III. Testing the Install
III.1 start the redis server
redis-server &
III.2 run rake
bundle exec rake
III.3 run a test crawl
bundle exec ./test/test_crawl.rb
IV try a real crawl using the DSL
flush the redis database
redis-cli flushdb
load the first job into redis
bundle exec ./examples/crossfit_crawl.rb
run the worker job
bundle exec ./bin/run_worker.rb -n crossfit-crawl
V. To view the queue monitor in a browser
bundle exec qless-web
this should launch a tab in the web browser if this fails, the monitor may still work , and may be visible in your browser at
localhost:5678
and that’s it — you have a dsl for crawling running locally.
VI. To run the crawler on AWS and EC2, you will need to set up an amazon account, install chef-solo, and create some security groups and s3 buckets.
Stay tuned for extended documentation and examples, including seeing the crawler in action on EC2. Feel free to email to ask questions or to express interest in getting involved.
Introduction:
In a previous post we introduced the problem of detecting Gravity Waves using Machine Learning and suggested using techniques like Minimum Path Basis Pursuit. Here, we drill down into the theoretical justifications of the general approach–called Compressed Sensing–and try to at least understand what the current theory actually tells us.
Linear Algebra: Solving Ax=b with an Over-Complete Basis
In traditional linear algebra, we solve the problem
using the Penrose Pseudo-Inverse
where is a vector of length
, and
is an m x n matrix
A standard and very old machine learning approach is to add a small constant to the diagonal, called Tikhonov Regularization or Ridge Regression
This is equivalent to constrained optimization problem
Basis Pursuit — or L1 Regularization
It has been known in geophysics and astrophysics since the 1970s that frequently we can do better if we minimize the -norm instead of the
-norm of
where
Note: sometimes this is written as an unconstrained optimization problem
subject to
appland this can be relaxed to
subject to
The constrained solution, and related -regularization problems, are readily solved using numerical techniques, and are available, in some form, in many R and MatLab packages and open source tools.
Familiar Open Source Tools
Liblinear: (L1 Regularized Classification)
The current version of Liblinear (1.9) includes -regularization for classification in (along with
-regularization SVM regression methods ):
./liblinear-1.91/train Usage: train [options] training_set_file [model_file] options: -s type : set type of solver (default 1) ... 5 -- L1-regularized L2-loss support vector classification ...
To solve a regression, we can use
Vowpal Wabbit (VW):
VW actually solves the more general problem (i.e. an Elastic Net)
where is a general Loss function for either classification or regression. We set the regularization parameters via the command line Update Rule Options
--l1 arg (=1) l_1 lambda
--l2 arg (=1) l_2 lambda
There even hardware implementations of compressed sensing: see the awesome Nuit Blanche blog for more details.
(Apparently) when solving a regression problem, we call the -regularization problem Basis Pursuit (BP) [7]
BP is a principle for decomposing a signal into an optimal superposition of dictionary elements (an over-complete basis), where optimal means having the smallest -norm of coefficients among all such decompositions.
Unfortunately, neither of these off-the-shelf programs will work to find our Gravity Waves; for that, we need to add path constraints.
Our signal is very weak (SNR << 1) and we are trying to reconstruct a continuous, differentiable function of a known form. We will be push the computational and theoretical limits of these methods. To get there (and because it is fun), we highlight a little theoretical work on Compressed Sensing
In the past decade, work by Dave Donoho and Emmanuel Candes at Stanford and Terence Tao at Berkeley have formalized and developed the theoretical and practical ideas of Basis Pursuit under the general name compressed sensing.
A common use of compressed sensing is for image reconstruction

Beyond the Sampling Theorem
To reconstruct a signal , we used to think we were bandwidth limited; this is the essence of the Nyquist–Shannon Sampling Theorem
Tao, a former child prodigy who won the Fields Medal in 2006, took some time off from pure math to show us that we are, in fact, limited by the signal structure, not the bandwidth.
We call S-sparse when at most only $latex S $ of the coefficients
can be non-zero (
. For example, many images are S-sparse in a wavelet basis; this is the basis of the newer JPEG2000 algorithm.
This allows us to reconstruct a signal with as few data as possible–if we can guess the right basis.
Ideal Sparse Reconstruction
In an ideal world, we would like to find the most sparse solution, which means we solve the regularization problem
subject to
where the norm is just the number of non-zero components.
This is very difficult to achieve numerically. It turns out that we can approximate this problem with the regularization problem, which can be solved as a convex optimization problem. And we even have some ideas why. But why ask why?
IMHO, applied machine learning work requires applying off-the-shelf tools, even in seemingly impossible situations, and yet avoiding wasting effort on hopeless methods. The best applied scientists understand the limits of both the methods they use and the current underlying theory.
Theoretical Justifications for Compressed Sensing
Any early idea showed that we can, in fact, do better than the Sampling Theorem just using random projections:
Theorem (Candes-Romber 2004):
Suppose we choose m points randomly out of n. Then, with high probability, every S-sparse signal
can be reconstructed from
, so long as
for some absolute constant C.
This result leads hope that this can be formalized. But what kind of conditions do we need on to at least have a hope of reconstructing
? A very simple result is
Proposition:
Suppose that any columns of matrix
are linearly independent. (This is a reasonable assumption once
.) Then, any S-sparse signal
can be reconstructed uniquely from
.
Proof:
Suppose not; then there are two S-sparse signals with
. This implies
But
is 2S-sparse, so there is a linear dependence between 2S columns of
. A contradiction.
So we might expect, on a good day. that every 2S columns of A are linearly independent. In fact, we can say a little more
Conditions on A and the Restricted Isometry Property (RIP)
There are now several theoretical results ensuring that Basis Pursuit works whenever the measurement matrix is sufficiently “incoherent”, which roughly means that its matrix entries
are uniform in magnitude.
In practice, numerical experiments suggest that most S-sparse signals can be recovered exactly when
Spectrum of a Gaussian Random Matrix
This means we can strengthen the proposition above by requiring, for example, that every 4S columns of are approximately orthonormal. It turns out that many well known Random Matrices satisfy this property, such as Gaussian Random Matrices.
Technically, the RIP is stated as:
Suppose that there exists a constant such that, for every m × s submatrix
of
and for every vector
we have
whenever
If the matrix satisfies some form of RIP, we can get a theoretical upper bound on how well the
BP will perform. A basic result shows that true
(i.e. RMS) error in the reconstructed signal is bounded by the
(i.e. absolute) sample error:
Theorem (Candes-Romber-Tao):
The tightest bound is due to Foucart, who shows that this bound holds when

The astute reader will recognize that the RIP asks that the signal be sparse in an orthonormal basis and that the data matrix be uniformly incoherent.
Is this good enough? Have you ever seen a uniformly incoherent distribution in a real world data set?
D-RIP
Many signals may in-fact, be only sparse in the non-orthogonal, overcomplete basis. Recently, it has been shown how to define a D-RIP property [10,11] that applies to the more general case of even when the matrix is not as incoherent as we might expect or desire.
In some earlier posts, we introduced the Regularization (or Projection) Operator as a way to get at What a Kernel is (really). Recall that to successfully apply a Kernel, the associated, infinite order expansion should converge rapidly.
Likewise, here we define a similar Operator, , that projects our sample data into the Hilbert Space, or Signal Domain
.

It is in this space that the signal is expanded in the over-complete basis
We might also refer to as a frame, although I skipping this technical detail for now. Essentially, this means that any infinite order expansion in
converges rapidly enough that we can use it as a basis in our infinite dimensional space.
We can now adjust the RIP, extending the notion of considering every m × s submatrix to considering , the union of all subspaces spanned by all subsets
of size s.
Technically, the D-RIP is stated as:
Suppose that there exists a constant such that
whenever
Similarly to the RIP, Guassian, subgaussian, and Bernoulli matrices satisfy the D-RIP with m ≈ s log(d/s). Matrices with a fast multiply (DFT with random signs) also satisfy the D-RIP with m approximately of this order.
There is also a known bound.
D-RIP Bound for Basis Pursuit (Needell, Stanford 2010):
Let D be an arbitrary tight frame and let A be a measurement matrix satisfying D-RIP with δ2s < 0.08. Then the solution to
-analysis satisfies
The result says that -regularization/analysis is very accurate when
converges quickly (has rapidly decaying coefficients)
This is the case in applications using Wavelets, Curvelets, and Chirplets.
Warning: Don’t just use Random Features
This does not say that we can use any random or infinite size basis. In particular, if we combine 2 over-complete basis sets, we might overtrain or, worse, fit a spurious, misleading signal. Been there, seen that. This is way too common in applied work, and is also the danger we face in trying to detect a weak signal in a sea of noise. And this is why we will add additional constraints.
Weak Signal Detection: A Preview
Recall that we wish to detect a very weak signal of the general form
in a sea of noise, where is a slowly varying function, and
is some oscillatory function of unknown frequency, phase, and envelope. We need to detect the presence of
–and the unknown critical time
. We might be tempted to first think we can just create an overcomplete basis of functions
for variety of critical times
and solve the Basis Pursuit problem.
This will most likely fail for 2 reasons
- each critical time
defines a unique
-frame, and we probably should not combine all
basis sets / frames $latex g_{\lambda}(t,\tau) $ in the same optimization
- even for a single pass of BP, there is just too much incoherent noise and/or perhaps even some other, transiently stable, detectable, but spurious signal.
Given these conditions, one might expect a clever, adaptive or matching pursuit-like approach to work better than vanilla constrained optimization (I’ll explain the difference in a bit). Recent research [12] indicates, however, that while we might believe
Folk
Theorem: The estimation error one can get by using a clever adaptive sensing scheme is far better than what is achievable by a nonadaptive scheme.
that, in fact,
Surprise: The folk theorem is wrong in general. No matter how clever the adaptive sensing mechanism, no matter how intractable the estimation procedure, in general [one can not do better than] a random projection followed by minimization.
with the
Caveat: ”This ‘negative’ result should not conceal the fact that adaptivity may help tremendously if the SNR [Signal-to-Noise Ratio] is sufficiently large
In our next post we will look at some real world examples of Chirplet Minimum Path Basis Pursuit using various path (and maybe even structural) constraints…stay tuned!
References
[1] 1995 Mann & Haykin, The Chirplet Transform: Physical Considerations
[2] 2004 Emmanuel Candes & Terence Tao, Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?
[3] 2006 Candes, Charlton, & Helgason, Detecting Highly Oscillatory Signals by Chirplet Path PursuitThe Chirplet Transform: Physical Considerations
[4] 2008 Greenberg & Gosse, Chirplet approximation of band-limited, real signals made easy
[5] 2008, NTNU’s Onsager Lecture, Compressed Sensing by Terence Tao, part 1 of 7
[6] 2009 Tao, Compressed Sensing
[7] Chen & Donoho, Basis Pursuit
[8] 2004 Candes, Chirplets: Multiscale Recovery and Detection of Chirps
[10] 2010 Emmanuel J. Candes, Yonina C. Eldar, Deanna Needell, Paige Randall, Compressed Sensing with Coherent and Redundant Dictionaries
[10a] http://www.cmc.edu/pages/faculty/DNeedell/RhodesTalk.pdf
[11] 2012 Mark A. Davenport, Deanna Needell, Michael B. Wakin, Signal Space CoSaMP for Sparse Recovery with Redundant Dictionaries
[12] 2012 Arias-Castro, Candes, & Davenport, On the Fundamental Limits of Adaptive Sensing
So how in the world would a Machine Learning Scientist predict an Earthquake? You might probably think, just collect all the data you can find, stuff it into Hadoop, and run some supervised machine learning algorithms. Eh…not so much!
What we will do is apply some models from Astronomy and Theoretical Physics to model the process, and see if the techniques developed for detecting weak patterns in Astronomy can be applied to the problem of detecting Earthquakes and other crashes in nature.
Moreover, we will, eventually, see how to convert a highly non-convex optimization problem into a convex (LP) problem. It is going to take me some time to get there…first some motivation
Scale Invariance in Nature:
When we look out at natural phenomena, we see a system of Fractals (and seemingly in Equilibrium). Indeed, many natural phenomena resemble Fractals, such as Trees, Mountains, and Coastlines.
A famous math problem is to compute the length of the Coastline of Britain. The problem is that the shorter your ruler, the longer the coastline seems. If we use a 200km ruler, we measure 2400 km. If we use a 50km one, we get 3400 km. And so on
When this happens, we call this, mathematically, Scale Invariance. Read more…
Today we consider the problem of modeling periodic behavior in a noisy time series. This is an old problem from Astrophysics that a has taken center stage again because of it is used to detecting planets orbiting around distant stars. As usual, it is also important in modeling many other real world phenomena. We will prep the discussion with some simple examples and a brief review of traditional Fourier Analysis, and then look at the classic solution called the Lomb-Scargle Periodogram. We will look at an example using the Python AstroML toolkit [1]. Later (next blog) we will will then look at in detail at some modern machine learning style solutions such as Basis Pursuit and Compressed Sensing. Finally, we use this approach to look at a hard problem–how to detect the onset of Discrete Scale Invariance and Log-Periodic behavior in a noisy time series. Stay tuned
Simple Analysis
We first consider a simple problem: Do Werewolves only come out during a Full Moon?
Suppose that every so often we observe some werewolves and we note this down in our diary. Let us plot this data after a year of recording.
We suspect that the Werewolves only come out during a full moon. How much confidence do we have in this? Our goal is to determine if the distribution of Werewolves sightings is periodic, or if it is random.
As pointed out by my friend Mike Plotz, totally random sightings would most be governed by a Poisson process, meaning that the arrival times would be exponentially distributed. So we should plot the distribution of arrival times and check.
Although recent research in biophysics suggest that random Werewolves would actually exhibit scale invariance [10]. We will look at this in our next post. Here, we look at a statistic that is also suitable for arrival times…
Rayleigh Power Statistic
The first thing we can do is simply check a simple, unbinned statistic for a given frequency, called the Rayleigh Power Statistic
By unbinned, we mean that we simply calculate the statistic for all data points without computing any histograms or using any kind of sampling.
What does this statistic measure? Suppose our times are uniformly distributed (i.e. werewolves come out at random). Then
We can compute a confidence of periodicity by using the statistic for a uniform distribution
For this data set, S=8167 for N=885, and days, our confidence is
So we have a very high confidence that werewolves do come out every 28 days.
What if we did not know the exact frequency ? No problem, we can just plot
vs
and look for the peaks
So why not just do this? Where does this break down?
- does not work for binned / sampled data
- may perform poorly if the data is not evenly sampled?
- not entirely clearly how to handle multiple frequencies
Traditional Spectral Analysis
Given a signal , we are frequently interested in the power spectrum or energy signal density
where is the continuous Fourier Transform of
We usually would approximate with the discrete-time Fourier Transform (DFT)
although typically to obtain a good result we need to sample on equally spaced points
We then can now express the Periodogram
which is clearly the binned version of the Rayleigh Power Statistic, suitable for multiple frequencies.
We can express the Periodogram as the solution of the Least Squares (LS) optimization problem
, where
Because our data is real, however, this expression is really
We see that LS problem includes a spurious term that does not depend on .
Consequently, in noisy time series, with uneven sampling, the standard Fourier approach will boost long-periodic noise. We need a different, more robust approach.
LSSA: Least Squares Spectral Analysis
What is LLSA? This approach is similar to the Periodogram above in that we formulate the problem as a LS optimization. We replace the spurious term above, however, by including an adjustable parameter in the cosine basis functions. Consider the following constrained minimization problem
where and
, and we constrain
so that all the frequency components are mutually orthogonal at the sample times
Rather than sample on regular time intervals, and look at all frequency components, we will use regression and / or machine learning techniques to find the optimal
We now can apply some old trigonometric identities. The constraint on can be expressed as
and we can transform the overall function minimization to a constrained linear optimization problem using the identity
This leads to
Let us now define the linear equation
, where
,
,
This is now a linear constrained problem, normally solved by Ordinary Least Squares (OLS). We choose a quadratic loss function to illustrate, but, later, we will see we can also solve this problem as a Machine Learning optimization using an norm or any number of methods. Using OLS, we obtain
We can now express the Power Spectrum, or Periodgram, as
Lomb-Scargle Periodgram
Without going into all the details yet, we can now express the spectral density in closed form:
where and
are the empirical mean and standard deviation, resp.
I will work out the proof soon
Extensions: Weighted Least Squares Spectral Analysis
When we have the covariance of our sample data, or some other model of the noise, we can include this into the OLS problem (see [4]). For example, we model the covariance at each frequency as
,
where is the diagonal matrix
We may plug in any estimate of the covariance to the OLS problem–as long as it is not ill-conditioned–yielding the new quadratic minimization functional
This leads to an improved expression for
but the Periodgram stil has the form of an expectation value of
This should also lead to a similar, analytic expression for the Lomb-Scargle Periodgram [to come soon, see 11 and references therein]
Issues with LSSA
We note that neither FFT nor LLSA approaches are robust when the number of samples is small, when the sampling is uneven, or when the time between samples is longer than the period being sampled See [11]. This case arises frequently in Astronomy, and needs to be handled using Machine Learning (or other) techniques such as a Maximum Likelihood Approach to the Rayleigh Power analysis (see [3]), or a Generalized Lomb-Scargle Periodogram [11]
Generalized Lomb-Scargle Periodogram: Weights and Offsets
There are several modern research papers [3],[4] (more to come) discussing the success and limitations of LLSA and its variants. Here we describe the variant provided in the AstroML package (below, to come)
Example: AstroML Package
The AstroML package [1] is very new and is built upon the very popular Scikits Learn Machine Learning library for Python. AstroML implements both the classical Lomb Scargle Periodogram and the Generalized version [11]. Here is an example
Here we have generated random data distributed around the function
, where
and we compute the Periodogram for periods . We see the Periodgram is quite noisy, but there is a dominant peak at
.
Can we do better?
Machine Learning Alternatives
Just a hint of whats to come. Using traditional (odl school) machine learning / Kernel SVM techniques, one can model the covariances and/or other noise by using a non-linear, kernel-SVM (i.e. [9]) But we note that we can also recast the quadratic optimization problem as a linear programming problem, called Basis Pursuit
subject to
or as a convex optimization, sometimes called Basis Pursuit Denoising
When would such approached be useful? The LSSA method would clearly be deficient when the OLS problem is pathological, or
is ill-conditioned. (See the Appendix of reference [4] for more details).
In particular, when the LSSA problem is underdetermined, then we need advanced techniques. Which leads us to the next blog, where we will discuss the modern variants of these old AstroPhysics and Geophysics methods–called Compressed Sensing. Stay Tuned.
References
[1] AstroML: Machine Learning and Data Mining for Astronomy
[2] Python example of the Lomb Periodgram
[4] Spectral Analysis of Nonuniformly Sampled Data: A New Approach Versus the Periodogram
…
[5] Machine-Learned Classication of Variable Stars
[6] Finding Periodicities in Astronomical Light Curves using Information Theoretic Learning
[6b] Period Estimation in Astronomical Time Series Using Slotted Correntropy
[7] Support Vector Robust Algorithms for Non-parametric Spectral Analysis
[8] Learning Graphical Models for Stationary Time Series
[9] Chapter 5, Kernel Methods and Their Application to Structured Data
[10] 2012 Scale invariance in the dynamics of spontaneous behavior
[11] 2009 M. Zechmeister &M. K¨urster, The generalised Lomb-Scargle periodogram: A new formalism for the floating-mean and Keplerian periodograms
Books and Review Articles
Advances in Machine Learning and Data Mining for Astronomy
Data Mining and Machine-Learning in Time-Domain Discovery & Classification








