My first blog on machine learning is to discuss a pet peeve I have about working in the industry, namely why not to apply an RBF kernel to text classification tasks.

I wrote this as a follow up to a Quora Answer on the subject:

*Please refer to*

Smola, Scholkopf, and Muller*, The connection between regularization operators and support vector kernels *http://cbio.ensmp.fr/~jvert/svn/bibli/local/Smola1998connection.pdf

I expand on one point–*why not to use Radial Basis Function (RBF) Kernels for Text Classification.* I encountered this while a consultant a few years ago eBay, where not one but 3 of the teams (local, German, and Indian) were all doing this, with no success They are were treating a multi-class text classification problem using an SVM with an RBF Kernel. What is worse, they were claiming the RBF calculations would take upto 2 weeks to run, whereas I could run a linear SVM in a few minutes on my desktop. At the time, it seemed to me that anyone who simply read the SVM papers could see this was foolish–and still…

So lets look at the SVM optimization problem and see why RBF Kernels do not apply to text problems.

Suppose we have *m* documents / training instances (*x < X*) and labels (*y* < Y). We seek a to learn a function f that maps X to Y

An SVM searches for a hyperplane that separates the data. If we write the hyperplane as f(x), then we seek optimal function f such that

This is obtained by minimizing the norm of f subject to the constraint that the positively labeled stay on one side of the hyperplane, and the negatively labelled instances on the other. This leads to the following constrained optimization problem

Most data is not perfectly separable, so we can add some slack to the problem by also minimizing the Hinge Loss function

leading the to complete, soft-margin, SVM optimization problem (constraints not shown):

More generally in machine learning, we want to learn a function f that maps instances (X) into labels (Y) that minimizes the expected value of our loss function

where L is defined as a Hinge Loss for Classification problems, and a Squared Loss for continuous Regression problems

We define a general, convex functional to minimize the linear Regularized Risk functional:

which may or may not be subject to (linear or quadratic) constraints, and where the norm of f is defined as the dot product for the Hilbert Space F, f < F (in Dirac notation)

Typically in Machine Learning, we define some abstract kernel function k_x such that

(Note that k_x plays the role similar to the Dirac delta function, for cases where the set of functions f are not orthonormal). This leads to a general solution to the convex optimization problem

Really this just says our solution can be expanded in a basis k_x. However, we don’t know anything about k_x yet, and that’s the problem we are trying to shed light on.

At this point, to make things more confusing, one typically uses the *Kernel trick* to introduce a Kernel (K) over a space of L2 functions such that the norm of f may be expressed in a more familiar Hilbert space:

This abstract form leads people to believe that one can just choose any vanilla Kernel and apply to any problem without further thought. To get beyond this obfuscated interpretation, we need to understand how K acts on f. And to do this, as seen above, we need the inverse of K — also known as its Green’s function g:

and explicitly write the norm of f as:

We now introduce the Regularization operator P over the space of L2 functions such that P*P is the Green’s function for K:

With P, we can now write the new, Kernelized optimization problem over an explicit class of L2 functions

We can now see that to find an explicit form of the RBF Kernel, we want to find the corresponding Regularization operator that acts on our original data points.

First, we note that when k(x,x’) is frequently function a single variable (translation invariant), then

So let us define k(x) as a simple Gaussian function (the RBF basis)

If we take the Fourier transform of k(x), we have k(w) in frequency space as

We can now find the RBF Regularization operator as the Weierstrass transform of the norm of f (also known as the Gaussian Blur function , a low band pass filter) , expressed in frequency space (note w >= 0)

where the operators O is a combination of Laplacian and Differential operators

So there we have it…the RBF Kernel is nothing more than (something like) a low-band pass filter, well known in Signal Processing as a tool to smooth images. The RBF Kernel acts as a prior that selects out smooth solutions. So the question is…does this apply to text or not…

Well of course not! What about text has to do with smooth solutions. Absolutely nothing. And, of course, this is borne out in practice, and almost everyone sees that Linear Kernels do as well or even outperform RBF Kernels for text classification.

So there is no part 2 and no experimental results ? Really ?

https://charlesmartin14.wordpress.com/2012/04/09/kernels-part-2-affine-quantum-gravity

https://charlesmartin14.wordpress.com/2012/09/06/kernels-part-2-affine-quantum-gravity-continued

https://charlesmartin14.wordpress.com/2012/09/06/kernels-part-2-affine-quantum-gravity-continued/

Well we can run some experiments and look; I have already done this nearly 10 years ago

its not that hard to do…jut run libsvm on some standard data sets and set the kernel to RBf, then compare to liblinear. It;s a few hours work. I guess I could to do and post it if there is an interest

How does one decide on which kernel to choose for an SVM (RBF vs linear vs poly kernel)?…How does one pick a Greens function to solve a differential equation? It’s the same problem I guess you can guess..that is OK. But if you actually know something about your problem, try to take advantage of it. Pick a Kernel that represents your prior…

Why does linear kernelized SVM perform much better on text data than on image data?…Image data is inherently dense and continuous. Perhaps that are 2 dimensions before applying the kernel. The desired solutions should be smooth when projected into a high dimensional space. RBF-Kernels are designed to select out smooth solutions (they …

[…] https://charlesmartin14.wordpress.com/2012/02/06/kernels_part_1 […]

[…] some earlier posts, we introduced the Regularization (or Projection) Operator as a way to get at What a Kernel is (really). Recall that we to successfully apply a Kernel, the associated, infinite order expansion should […]

Is there a follow up to this post ? I’m interested to understand why RBF is not suitable for text classifications.

the idea of the post is to suggest that an RBF kernel is not much different from a traditonal low pass filter (technically not a band pass filter, as it says)

so i would suggest taking a collection of documents a computing the fourier spectrum associated with a low pass filter to show there is not specific pattern in the frequency domain to pass through

[…] 8 Arun Iyer, I work in a small corner of the big room of ML.I work in a small corner of the big room of ML. Votes by Anonymous, John McGonagle, Anonymous, Volkan Cirik, and 3 moreLoading….Consider the polynomial kernel of degree 2 defined by, where and .Thereby, the kernel function can be written as, . Now, let us try to come up with a feature map such that the kernel function can be written as .Consider the following feature map, . Basically, this feature map is mapping the points in to points in . Also, notice that, which is essentially our kernel function.This means that our kernel function is actually computing the inner/dot product of points in . That is, it is implicitly mapping our points from to .Exercise Question : If your points are in , a polynomial kernel of degree 2 will map implicitly map it to some vector space F. What is the dimension of this vector space F? Hint: Everything I did above is a clue.Now, coming to RBF.Let us consider the RBF kernel again for points in . Then, the kernel can be written as (assuming gamma = 1). Using the taylor series you can write this as, Now, if we were to come up with a feature map just like we did for the polynomial kernel, you would realize that the feature map would map every point in our to an infinite vector. Thus, RBF implicitly maps every point to an infinite dimensional space.Exercise Question : Get the first few vector elements of the feature map for RBF for the above case?Embed QuoteComment Loading… • 19 Apr 1 Charles H Martin, Consultant, I predict thingsConsultant, I predict things See my blog posthttp://charlesmartin14.wordpress… […]

Reblogged this on @ infinity.

Reblogged this on A Positronic Brain and commented:

An intriguing article. To look at an RBF kernel as a low pass filter is something novel. It also basically shows why RBF kernels work brilliantly on high dimensional images. Given that your image features generally lie in a continuous domain, an RBF kernel generally can fit smooth solutions and thereby create more relevant separating hyperplanes,especially in case of multiple classes.

I got lost in the math when you introduced the “regularization operator”, but your conclusion is still helpful in understanding when to use the RBF.