urjnasw xkfjjkn's hot blog

2013年3月27日星期三

A few things to learn about machine learning


A few things to learn about machine learning

(1)  Learning = Representation + Evaluation + Optimization
Representation: A classifier must be represented in some formal language that the computer
can handle. Hypothesis space: A set of classifiers that a learner can learn.
Evaluation: An objective function used to distinguish good classifiers from bad ones.
Optimization: Search among the classifiers in the language for the highest-scoring one.
(2)  It’s generalization that counts
The fundamental goal of machine learning is to generalize beyond the examples in the training
set. It is a huge mistake to train and test on the same data. But holding data will reduce the amount of
available for training. One solution to this contradiction is cross-validation: randomly dividing your
training data into (say) 10 subsets, holding out each one while training on the rest, testing each learned
classifier on the examples it did not see, and averaging the results to see how well the particular
parameter setting does .
(3)  Data alone is not enough
Every learner must embody some knowledge or assumptions beyond the data it is given in order
to generalize beyond it. This is called induction. It turns a small amount of input knowledge into
a large amount of output knowledge.
A corollary of this is that one of the key criteria for choosing a representation is which kinds of
knowledge are easily expressed in it.
(4)  Overfitting has many faces
Overfitting problem occurs when we try to determine classifier with insufficient knowledge and
data.
One way to understand overfitting is by decomposing generalization error into bias and variance.
Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency
to learn random things irrespective of the real signal.
(5)  Intuition Fails in high Dimensions
Another big problem after Overfitting is curse of dimensionality. Basically, it means generalizing
gets harder as your dimension s get larger. If you have 100 dimensions and 1 trillion training
samples, you only cover 10 to the negative eighteenth power of the possible combination of the
input space. That’s tiny! Sometimes more dimensions mean more noise.
(6)  Theoretical Guarantees are not What they seem
In machine learning field, the most common theoretical guarantee is a bound on the number of
examples needed to ensure good generalization. One of the major developments of recent
decades has been the realization that in fact we can have guarantees on the results of induction,
particularly if we are willing to settle for probabilistic guarantees.
(7)  Feature engineering is the Key
The most important factor is the features used. Learning is easy if you have many independent
features that each correlate well with the class.
(8)  A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.
The sufficiency of examples is necessary.
(9)  Learn many models, not Just one
The best learner varies from application to application. If instead of selecting the best variation
found, we combine many variations, the results are better—often much better—and at little
extra effort for the user. Bagging is an example of this, in which “we simply generate random
variations on the training set by resampling, learn a classifier on each, and combine the results
by voting. This works because it greatly reduces variance while only slightly increasing bias.”
(10)  Simplicity Does not imply Accuracy
Occam’s razor states that in machine learning, given two classifier with the same training error,
the simpler one has the lowest test error. However, “no free lunch” theorems imply it can-not
be true. Simplicity is a preference that is not related to accuracy.
(11)  Just because a function can be represented does not mean it can be learned
Don’t be lured in by claims that every problem can be representable by a certain algorithm. It
may be representable but not learnable given your training set.
(12)  Correlation Does Not Imply Causation
Correlation is potentially causation but should not be stated as such without proof

Bibliography:
Pedro Domingos, A few useful things to know about machine learning Communications of the ACM
CACM Homepage archive Volume 55 Issue 10, October 2012 Pages 78-87