Measures of Complexity in the Theory of Machine Learning
Even long ago it was understood that the greater the complexity of a model, the larger should be the size of the learning set. This refers to the problem of function reconstruction based on empirical data, learning pattern recognition or, in general, model construction using experimental measurements. Probably the first theoretical result here was the Nyquist criterion (in Russia, the Kotelnikov theorem). It stated that, if one wants to reconstruct a continuous function on the basis of a set of measurements at discrete points, then the number of measurements should be proportional to the width of the function spectrum. It means that the spectrum width can serve as one of possible metrics of complexity.
In general, for the given amount of learning data, one has to limit oneself on a certain level of the model complexity depending on the data volume. But for practical implementation of this idea it is necessary to define the general notion of complexity and the way to measure it numerically.
In my works with V. Vapnik, we reduced the problem of a learning system’s ability to generalize data to the problem of the uniform convergence of frequencies to probabilities over a class of events (or means to expectations over a class of functions). If such convergence holds, then the system is able to be learned. But not on the contrary. It is possible that uniform convergence does not hold, but the system still has the ability to learn.
Conditions of the uniform convergence are formulated in terms of an index of evens class over a given sample, growth function and the so-called VC-dimension or entropy. The VC-dimension allows us to get estimates of uniform closeness of frequencies to probabilities, which does not depend on probability distribution over input space. Asymptotic entropy per symbol gives necessary and sufficient conditions of the uniform convergence, but they do depend on the probability distribution. In most important cases the VC-dimension is equal or close to the number of unknown model parameters. Very important results in this field were gained by Talagrand, Rademacher and others.
And still there are cases when a decision rule with a large number of parameters is sought, but only a few examples is sufficient to find. Let us consider the example of two classes in n-dimensional Euclidian space. Each of the classes is formed by a ball having diameter D, and the distance between the centers of the balls is equal to R. If the ratio between the distance R and the diameter D is large enough then it is sufficient to show only two examples to reach 100% of correct answers. And it does not depend on the dimension of the space. A similar situation appears for recognition of two classes under supposition of feature independence (and some other conditions). Busting algorithms constructs very large formulas, and in spite of this, they reach good results even for a limited amount of learning data. All these facts force us to search for new measures of complexity, which are not directly connected to the notion of uniform convergence. It seems that they should depend on the probability distribution. But that is the nature of things.