Explanation of overfitting. The MDL theory gives an elegant explanation of why too rich representational schemes tend to overfit: When the encoding of the classifier itself is longer than the original data, or almost as long, then nothing is gained in terms of description length. E.g. You can represent K numbers as the values of a K-1 degree polynomial, but no description length is gained, since you now need K values for the coefficients of the polynomials. You can exactly fit a decision tree to data, if there is a separate leaf for each datum, but again no gain. You can cluster N points tightly into N clusters, one per point, but again no gain.