Ask me anything … about machine learning

Today I am running an experiment and opening my blog  to machine learning questions.



Please ask questions in the comments section.

I will try to answer them in the blog  over the weekend of the next week or so.

Thanks in advance.

42 thoughts on “Ask me anything … about machine learning

  1. Hi Charles! You’ve pointed out before that for many practical semi-supervised learning tasks, it is expensive to estimate the label fractions in the unlabeled data. For some of the exciting recent results, both the labeled and unlabeled datasets are balanced. Have you seen any attempts to characterize the performance impact of deviations from this ideal?


    • This is an excellent question, thank you.

      Vapnik has been a big advocate of Transductive learning, and has been trying to do this for years. The motivation is based, in part, on the proof the VC bounds. In the proof, it is necessary to create a statistical replica of the training data. So the idea has been around a long time that learning algorithms should be able to incorporate unlabelled data, and there has been a running effort to try and do this.
      For example, there is an old, famous paper by Blum and Mitchell
      Combining Labeled and Unlabeled Data with Co-Training

      The thing is, this paper is half implementation and half ‘theory’, since they are trying to formulate a PAC bounds approach to SemiSupervised Learning.

      You have to be careful reading any scientific paper or new method to dig in and see kind what assumptions are really being made and what exactly is the problem it has solved. Being theoretical, it seems to be ok to simply say “we assume we have a statistical replica of the data.” In practice, this might be the problem you are actually need to solve–i.e. how to find that replica!

      Moving on the Transductive SVMs (TSVM), these have been around as long as SVMs themselves. Indeed, I think perhaps the popularity of the SVM arose, in part, because it was thought that they could do both Supervised and SemiSupervised learning. For example, the first versions of SvmLight contained a TSVM which was sold as a tool for text classification. But as an input parameter, it was always assumed that the unlabelled data was a ‘statistical replica of the training data–just like in the theory.

      What does this mean in practice? Well, it means that you can simply measure the fraction R of positive-to-negative examples, and use this as an input. That is, it was assumed that you could estimate the ‘mean label’ of your unlabelled data by taking a labelling a sample of the data. Of course, the whole point of the TSVM is to train a system with very little labelled data, and the method appears to be highly sensitive to input parameter R.

      So YES–this has been analyzed to some extent and understood in the literature–but it is usually discussed in a different way that can seem impentrenable to the practitioner. For example, there is the so-called Safe-SemiSupervised SVM, or S3SVM, method
      Towards Making Unlabeled Data Never Hurt
      That is, there is an attempt to constrain the S3SVM/TSVM so that it always yields good clusters. This can be done, for example, by adding a regularization parameter which is some kind unsupervised clustering metric. So, for example, say you have a metric, like the Silhouette Coefficient. You solve the S3SVM/TSVM problem as before, but you also ensure that the metric is does not get too large.

      Another, related idea, proposed by Vapnik, is the Universum, which I have discussed on my blog. Here, one introduces a third class (Unknown, Other, Universvm, etc). So the TSVM problem begins to look like a 3-class SVM. Unlabelled data can then go into 1 of 3 classes. But, again, you need some constraint to make sure the unlabelled data is going into the right class, and, as usual, this can be cast into some kind of geometric argument which must hold–if you choose the correct feature space.

      Of course, the other problem is that all of these methods require a feature space and a choice of regularization. So these methods can only work if there is some way to select the regularization parameters (including the R parameter), and the standard way to do this is to apply Leave One Out Validation on the labelled data. So if you only have a tiny amount of labelled data (i.e. say 2 labels), this obviously is not going to work.

      Generally speaking, however, the I think this is typical of some academic work. The professors know the issues, and they have some ideas what to do. They let the students try out the ideas, but the students don’t really understand the broad context of what they are working on. So the student tries out the idea of the professor, but does not pose the really hard problems or place the work in a context that is fully useful to practitioners, or at least practitioners who are not working deeply on the problem.

      Liked by 1 person

      • Your comment above about professors understanding a given problem and its context, but not their students is very tantalizing. What are those wider problems + contexts, roughly speaking?


        • What is the hardest problem today? How do people learn when they have so few labeled examples. In other words, how can we learn by combining feedback on what we have learned already and absorbing the vast amount of unstructured, unlabeled information in the world. It is such a hard problem that is is not even clear how to formulate it.

          One formulation of this is Transductive and/or SemiSupervised Learning. Here, as Matt noted, the critical realization is that knowing a little bit of information about a distribution may, in fact, be the same as knowing almost everything. This is , of course, how Gaussian distributions work–if we know the mean and variance, then we know everything. But the reverse is somewhat more puzzling–can we replace knowing everything about a general distribution by just measuring a few gross statistics like the mean and variance? The problem is, in order to actually measure the mean, we need a lot of data to get a good estimate. So we don’t ever really know them! so how sensitive is the solution to the accuracy of our inputs. and it is very sensitive in many cases!

          In SemiSupervised learning it is critical to understand what the inputs are because the basic TSVM problem itself can be made convex. That is, there is exists an exact, unique solution to the problem. At least up to some reasonable level of accuracy (we hope). That is, there appears to be a reasonably useful convex relaxation to the problem.

          This seems like it can’t be right in general because people make mistakes and need to learn from these mistakes? So what is going on? What is the actual input to the method that makes the solution unique?

          In fact, the SemiSupervised learning methods always rely on some other kind of information available to us. so what might this information be?

          It could be classical prior information, as in a Bayesian formulation.

          It could be geometric information. It could be a heuristic metric, like an unsupervised clustering metric

          It could be what Vapnik calls Privileged Information: knowledge a teacher provides to a student. This could be a description of the information. It could be how hard or easy a specific example is to classify, detect, recognize, etc. It could be the importance of one example over another.

          It could be what Hinton calls Dark Knowledge. This is the relative probabilities that a classifier learns about when it is wrong. In other words, how good are our ‘second guesses’ , relative to each other. Or it is like, say, comparing getting partial credit when we are wrong on a test.

          I would say today this is one of the harder open problems that people are working hard on.


    • A good place to learn this is Smola’s new book:

      I skimmed the contents and I think it has exactly what you are looking for.

      Warning: this is a graduate level text book and assumes some solid knowledge of mathematics.
      Let me know what you think and if this is the level you are looking for.


  2. Hello Charles

    I think some of the information that is neither simplified, not covered in depth is the explanation of the Hyper-Parameters and their Tuning (optimal values of this parameters for getting best results) for some of the key Machine Learning Algorithms that can greatly impact the performance of those algorithms.
    Would you throw some light on the explanation of Hyper-parameters and their tuning for the below algorithms-
    Simple Decision Tree
    Random Forest
    Adaptive Boosting
    Naive Bayes Classifier

    Best Regards


    • It is necessary to tune the hyperparameters to obtain a good generalization accuracy.

      The standard way to do this is to start with 2 labeled data sets: train and test. You train a model on the train set, with default parameters, and measure how well the model predicts the labels in the test set. You can then modify the parameters of the model, retrain it, and measure it’s the test performance. The best model parameters are those that give the best test performance.


      This is the general idea.

      Here are some things to remember when doing this:

      1.) Typically you need to rescale the data. Usually you scale the training data to have zero mean and unit variance (1). this can be done, for example, in scikit learn using the StandardScalar.

      You then need to scale the test data, BUT using the parameters you learned from the training data


      If your data is unusual, you may need something like the Robust Scalar

      2.) For a classifier, you may want to just measure the test performance using the cross validation accuracy. But this really only applies to a perfectly balanced classifier. That is, every classifier should have the same number of instances in the class. If this is not the case, then you may need a different performance metric.

      Alternative metrics include the AUC, the log loss, etc.

      and for a Regression problem, this can be harder. Usually Regression is evaluated using R^2, nut R^2 may not be a good metric at all for you data.

      3.) If the test data is very small, you can not run standard n-fold cross validation. Instead, you need to run leave one out validation

      4.) Sometimes you data is too large to do a brute force grid search of all the parameters. In these cases, you need to search the parameter space stochastically.

      5.) When training models and searching parameters, it is critical that the training data does leak information into the test data. the scikit learn pipelining feature can help here

      6.) On the specific methods, every method has some tuneable parameters. Each method is specific and you have to study a bit what experience others have and with you data set.

      Many methods try to offer a reasonable guess for the parameters, and there is some theory for what they should be.

      for example, with SVMs, it is the cost function and the kernel parameters. Still, you have to pay attention to the math. for example, for the RBF kernel , the gamma parameter should not be zero.

      You also have to be sure that the parameters are somewhat stable depending on how you select your test/train splits. We should not expect the parameters to change greatly if you choose different splits

      In fact, parameters should look the same across similar data sets. For example, in text classification, the cost parameter for a linear SVM is usually 0.01, using bag of words features

      7.) Regression problems are harder than classification problems. For example, see my earlier post on Tinkhonov Regression and “When Regularization Fails”.

      8. ) I don’t have any specific advice for say Random Forests or Adaptive Boosting. I would suggest just trying out some variations on your data


    • It is a very competitive field that is destined to grow. For example, recently, PWC announced that they are going to hire 1000 data scientists to help with the M&A business. This in itself tells me the field is expanding. Many companies would like to take advantage of machine learning, and they need senior leaders who can manage and interpret this new world of data and algorithms.

      Another good example is Ayasdi, the Stanford startup that does machine learning using topological data analysis. They must have 300 employees now, and some very large clients. The thing is, data science consulting is very high touch, and it is difficult to manage that many people and maintain a massive number of complex projects and in internal braintrust. So I believe that there is a lot of room for specialization.

      Also, there are lot of vendors, like Cloudera, Hortonworks, Databricks, etc, producing data science tools, and companies will need someone to use these tools to generate ROI. I think the sweet spot for this is about 2-3 years away, since it takes large companies some time to get all this installed and to organize their data. And while the vendors do offer outsourced consulting, it is difficult for me to believe that an large software engineering firm can also support a large , in-house machine learning consulting / professional services business. It is just not the core of what they do. It is very high touch and does bring in recurring revenue. Still, they need something like this to help sell their products.

      Instead, what I imagine is that there will partner with local specialists who can work with the vendor customers. this is similar to Oracle; they produce the software, and there is a massive secondary market for Oracle consultants.

      Still, it is a good question. There is evidence that a lot of companies are beginning to build their own internal data science capacities

      But as one commenter put it
      “To do Data Science right you need PhD mathematicians. You can do it wrong for a lot less money.”

      It is an exciting time.


  3. What’s your opinion on post-doctoral data science bootcamps (e.g. Insight and Data Incubator)?

    For comparison, how hard is it to get into the field when you have a PhD in a mathematical science but your only programming experience consists of small personal projects?


    • I don’t think a data incubator would be a good choice for a PhD unless it assumes that you already have a very strong math background and are looking to learn coding specifically. That said, it is challenging to move from PhD programming to a commercial coding environment, if, for any reason, there are just so many new tools and new technologies. Spark. IPython. R. Plus all the supporting tools like AWS, git, redis, etc. I certainly would not recommend joining a company and then using Mathematica for your work; I have had to deal with this and it is very painful for other team members. The thing is, I think you could be the top winner on Kaggle and still be completely dysfunctional in a startup or small company. So if you pick a data incubator, I would say try pick one specifically for PhD trained scientists.


  4. What are the major ideas of Machine Learning (ML) in your view? I see a lot of what look to be false dichotomies: bayesians vs frequentists, convexity vs locality (max/min), autoencoders vs RBMs vs ??? etc but never a good summary of the ideas (and their importance) behind them.

    What are the most underrated/surprising ideas in ML? I recall a teacher being supremely impressed by Efron’s work on the Maximum Likelihood Estimator but his euthusiasm got lost in the telling of it. I imagine there’s some awesome modern nuggets hiding out like Hinton’s discovery of partition functions being unsolvable yet he could use MCMC to hum a few bars of it.


    • What are the major ideas of machine learning?

      Machine Learning seems to encompass a very wide number of techniques. Machine learning is a HUGE field. It is really a field for graduate level training, and require expertise in a broad range of mathematics, plus software engineering and computer systems architecture, as well as training as a scientist to pose relevant questions.

      I will try to lay out some of this basic forms, but it will probably take me all weekend to answer this.

      A. Supervised Learning
      We have some data. We have some labels. Predict.

      This can be thought of as

      A.1 Regression problems:


      In an earlier post I laid out these concepts using a more the traditional mathematical physics notation of integral operators

      the Kernels appear in this form naturally since they are related to the Greens functions we know and love from differential equations.

      But Machine Learning uses the Kernels differently, and this takes some getting used to. They have their own notation, and frequently they just guess a Kernel. Being able to do this is non-trivial, and Ill discuss below

      The main issue with ML is that most methods are regularized. Why ? It is a simple observation, that has been understood for like 100 years, that if you try to solve Ax=b as

      x = A^{-1}b

      there are numerical instabilities that can arise

      So the classic approach is to apply Ridge Regression, or Tikhonov Regularization.
      And for a very long time, this was of great interest to machine learning researchers

      But having 1 method for solving all problems is not so useful

      One problem with Ridge Regression is that it even it can be numerically unstable, see

      so an easy way to avoid this to run Logistic Regression

      A.2 Classification (i.e. Logistic Regression)

      This is just a regression against discrete labels instead of continuous ones. It can take many forms

      Logistic Regression
      Random Forest
      Soft Max / Perceptron

      and so on

      They are all nearly the same thing: See

      However, Logistic Regression can also suffer numerical instabilities in very rare cases (it is suspected to do this when the classes are perfectly separable).

      Hpw are these solved? Why do we use convex optimization for Logistic regression vs MAXENT? They are virtually the same thing.

      Still, all of these methods are numerical methods, and there are slight practical differences that arise when engineering a solver.

      Within just this space itself we see a number of mathematical formulations which rely upon different mathematical principles ; to a trained mathematician, these methods are all related but, of course, have various tradeoffs. Why should we choose a Euclidean distance metric over say the Cross Entropy? What is the difference between using Regularization vs a Bayesian prior? When is a Random Forest a better choice than an SVM?

      The are hundreds of small questions that can be posed that need to be addressed on an individual level.
      I can’t address all of them but I will try to provide some highlights of the important methods

      A.3 Structural SVMs
      This is just regression onto a structure or pattern like a small tree or a graph, instead of a number
      For example, determining the rank order a directed list, or a parse tree

      By this I don’t just mean a rank SVM – that is just ordinal regression and is an old idea
      I mean solving the regression problem subject to complex constraints

      Structural SVMs are classic example, and took advantage of techniques from convex optimization to solve
      The modern form of this is a Recursive Deep NEtwork

      If you read the Socher Thesis, you will see that the mathematical formulation relies on the loss function defined in a structural SVM

      A.4 Advanced Convex Optimization
      There are lots of variations here, and it is possible to reformulate the basic SVM / Regression Problem for special cases that arise in real world problems

      Indeed, there is an entire field of convex optimization that deals with this, and it is very specialized and a bit difficult to learn

      a classic example is SemiSupervised and/or Weakly Supervised Learning
      That is, say we don’t actually have all the labels, either they are missing, they have some errors, etc
      Here, we can compensate for this by adding in what we do know. So, suppose we know the correct fraction of labels (+/-). Or we have some way to estimate it. Then we can run some form of Transductive Learning:

      A.5 Learning Using Privileged Information: LUPI

      Recently it has been recognized that real world learning using involves a teacher and a student.
      For example, suppose we are trying to detect cancer in images of biopsies. We have the images.
      But we also have a radiologist who can describe the images in words.

      We say the descriptions are ‘privileged information’ because they are only available at training time..the whole point of the classifier is to augment and/or even replace the radiologist. And the descriptions are ‘privileged’ because they use a completely different set of features

      This has led to the development of a new principle, called LUPI

      In this form, LUPI appears to be a very complex convex optimization problem.

      However, the basic principle is the idea of transferring knowledge from a teacher to a student, and, as with all mathematical methods, there are different formulations

      For example, LUPI can be recast as a weighted SVM (where we assign a weight to each example itself

      and the problem of transferring the information can be reformulated as asking “how to weight the training examples”

      Another formulation lets us replace the convex optimization problem, which is complicated to implement, using a simple linear combination of 2 cross-entropy functions

      So we see that the basic principle is not the specific form of the mathematical solver–although this is important–but the general set up of the problem itself

      I hope this is a helpful way to answer the question. If so, please let me know, and I’ll take a stab at the Unsupervised case next. Otherwise, please follow up with details questions below.


      • Hmmm… a bottom-up answer. I was thinking more top-down answer with less mechanical math and more your own special human insight. Like why do you think you can get machines to answer questions from big piles of messy data? Why does Ax=b (learnedness*example=solution) make sense? Why is it insane? Same for deepnets. Why are line fitters, classifiers, recognizers cool? And why do they utterly miss the point? What is the point (of ML)?

        I can’t say much on the surprises part since this is a personal question and it entirely depends on where your thinking is at. Which is just as interesting as the surprises in particular.


        • Ah I see. I think this is like ‘when does regression work and why’

          Let me provide an example and start a conversation on this. When I first starting doing text classification and NLP, it was really surprising to people that one can actually do NLP using just Ax=b. That is, just using techniques from computational statistics to model text. It is just assumed that we need to have verbs and nouns and word order, that we need to compute ngrams for phrases, etc

          In school we learn about grammar. We learn that “dog bites man” is “subject verb object” and that “man bites dog” is really unlikely.
          So why does Ax=b work so well?

          This had been known for a very long time; the original work in the field was done at U Chicago / Bell Labs and is called Latent Semantic Analysis – or LSA. But it is not just about showing ads on web pages; the original scientific work also addressed a scientific question, “how do children acquire so many words so fast”. That is, if all reading is just analyzing sentences using grammar and logic, then how can a child acquire the knowledge of word so quickly without seeing lots and lots of sentences with that word. Of course, anyone with a child observes that children learn words by association: Dog bites man. Cat scratches man. Cat bites mouse. Cat and Dog are similar. And this association is primarily statistical

          There is a very long history of using mathematical techniques to analyze text in the social sciences as well for developing tools such as the Flesh-Kincaid Readability Metrics

          the General Inquirer for Sentiment Analysis

          the Five Factor Model of Personality Types,

          and so on. So applying machine learning to text is the natural extension of old, proven work.

 appeared I did not have wordpress comments enabled earlier. We can go 10 levels deep now.


  5. Hi Charles,
    1. Is it true that machine learning is just a different name to application of “statistics and probability” through computers ?
    2. So much of genomic data is available now. Through machine learning is there a way to get meanings and answers for lot of medical problems with that?
    3. I agree too many sub-fields exist in machine learning and the research problems should be focused on specific area of interest. How do I identify latest research problems in machine learning other than reading all research papers or literature(which is too much time consuming)?


      1. I think it depends on what you call statistics. 20 years you would not find anyone in a statistics department talking about Hamiltonians and spin glasses or phase transitions. You really had to study theoretical chemistry or condensed matter field theory. Today the mathematicians and statisticians have discovered what the theoretical chemists were doing. And as my advisor you to say, “if you can’t do anything new, invent your own notation”

      Indeed, Hinton, the founder of Deep Learning, was a graduate student of Longuet Higgins, a former and quite famous theoretical chemist.

      This is not to say that they are not doing new and great things–but I think I it is accurate to say that the most modern and powerful methods like Deep Learning did not use the traditional ideas, notation, and language of statistics. It was based in the language of chemical physics.


  6. As a marketing professional I am curious to know your thoughts on how machine learning has evolved to be a marketing solution. Thanks!


    • That’s an excellent question. It amazes me how far machine learning has come as a marketing tool. Of course, machine learning is a key technology for online display advertising, e-commerce recommender systems, and even direct marketing. And there is so much more! Here is a great slide that highlights all of the new companies using machine learning for marketing


  7. So much of genomic data is available now. Through machine learning is there a way to get meanings and answers for lot of medical problems with that?


    • Genomics is a very specialized field that I think requires a lot of domain knowledge and fairly complicated hierarchical models to make sense of it.


  8. I agree too many sub-fields exist in machine learning and the research problems should be focused on specific area of interest. How do I identify latest research problems in machine learning other than reading all research papers or literature(which is too much time consuming)?

    Liked by 1 person

    • If I were looking for ideas I would go to w conference and talk to people . It usually takes a year of reading and hard full time work to get up to speed on a new subject


  9. This question is more about the feature engineering aspect of ML. How do you approach the problem of crafting intelligent features from available data? How would you go about doing that keeping in mind that the they should be robust and quick to develop for efficient real time prediction? Any pointer on this specific area would be greatly appreciated.


  10. I don’t use R that much. Basically you need to generate feature space yourself. This can be done by computing say all cartesian products of all features, upto a threshold, and then running an L1 SVM. This is how, for example, libshorttext works to generate all possible ngrams for classifying short text.


  11. So good , So professional!!!
    Thanks a lot!!
    By the way, what link to video to study machine learning your would recommend?
    To understand ideas of machine learning : SVM, deep neural networks, ensemble learning (like random trees, GBM, xgboost, etc), how combine results of many machine learning algorithms, statistical inference for big data (not big data engineering but big data mathematics)
    Thank you very much in advance


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s