Without explicitly using the evidence procedure, the Occam factor argument can be summarized as follows (see [MacKay 1991, Berger e t al. 1992, Loredo 1990, Jeffreys 1939, Gull 1989a, Garrett 1991D. Consider a parameter space C. Define a "model", or a "theorist", as a mapping from any c E C to a target function from X to Y. (This is essentially the same as what is called a "method" in [Wolpert 1990] or an "interpreter" in [Pearl 1978].) As an example, i fX is the real numbers, R, as is Y, and i fC is the set of possible quintuples of real numbers, the 4th order polynomial series
using those five parameters is a model: the model is the mapping {PO, PIo P2, P3, P4} -7L~o Pi xi. (Note that this example could be easily modified so that either X and/or Y is not infinite.) Another example of a model, which uses the same C but in a nonlinear manner, is the following 5th order series of Legendre polynomials: {PO, PI , P2, P3, P4} -7 L~o Li(Pi x). Note that the
image space of C (i.e., the set of functions from X to Y which are expressible with some c E C)differs for the two models. Together, a particular model and a particular set of parameter values define a particular target function. Accordingly, I will often write (m, c) as shorthand for the function given by parameter c and model m.
Now consider two models, ml and m2, with associated parameter spaces CI and~. For simplicity, assume that both CI and are subsets of the same Euclidean vector space and have the same dimension. Assume further that CI C C2. (For example, CI might be the interior of one hypercube in R n, and C2 the interior of a larger hypercube, properly surrounding CI.) Let c i refer to
elements of C1> and similarly for c2. Our event space consists of triples (data, model, parameter value from the parameter space associated with that model). So for example P(data =L, model =m1> C2 parameter value =c2) is undefined.
Now in general, the posterior for a model, P(mj I L), equals peL I mil x P(mi) / peL). In turn, PeL I mil = Idci peL I mi, Ci) x P(ci I mil. Examine two particular models, ml and m2. Since we have no way of choosing between the two models, by the "principle of indifference" [Loredo 1990], we
might wish to take P(ml ) =P(m2). Using this gives
P(m!1 L) / P(m2 I L) =peL I ml ) / PeL 1m2)
=IdcI peL I ml , CI) x P( c i I ml ) / Idc2 PeL 1m2, c21 x P(c21 m2).
This is the so-called "Baye s factor" for model ml over model m2. Dividing it by the ratio of maximum likelihood values, (maXc I[P(L I ml , cI)]} / (maxC2
[P(L 1m2, c21]), we get the so-called "Occam factor" [Loredo 1990]
To see why this might have something to do with Occam' s razor, for simplicity assume that the ratio {IdcI PeL I ml , cI)} / (fdc2 PeL 1m2, c2)} can be well approximated by the ratio maxc i[P(L I ml , cI)]} / {maxC2
[peL 1m2, C2)]}. (This might be reasonable, for example, if
PeL I mi, Ci) is peaked a s a function of ci, for both i =1 and i =2.) Also assume the "uninformative"5 fonn for P(Ci I mi), namely a unifonn density: P(Ci I mi) = 1/ [fc. dCi 1] '" [V(q) rl. These conditions giveP(mllL)P(mZ'L)V(Cz) x maxci[PeL I mb CI)]=V(CI) x maxCZ [PeL , mz, cZ)]
Dividing the right-hand side by the ratio of maximum likelihoods, we see that the Occam factor for model 1 over model 2 is simply the (inverse) of the ratio of volumes of the associated parameter spaces. To clarify the discussion, assume that in addition to P(ci I mi) = 1 / [fc. dCi 1], we also have 1f dCI [PeL I ml , cI)] =JdcZ [PeL I mZ, cZ)] (whether or not the ratio of those integrals equals the ratio of the respective maximum likelihoods). Under this assumption, the ratio of P(ml I L) to P(mz I L) is jus t the ratio of volumes of the parameter spaces. So everything else being equal, the "bias" favoring ml ove r mz is given by (the reciprocal of) the ratio of the volume of CI
to the volume of Cz. Models with a large a priori range of possible parameter values are penalized. This is the basis for the conventional "Occam factor" argument for why Occam' s razor must hold a priori ([MacKay 1991, Jeffreys 1939, Berger 1992, Loredo 1990, Ou11l988, Oarrett 1991]), for the case
where CI and Cz have the same dimension but different volumes.