context | f:X×Y→R … measurable in X |
definition | pX:X×Y→R |
definition | pX:N∗Xf(x,y) |
This is really just the conditional probability when coming from a joint “probability kernel”, i.e a function f(x,y).
The operator N∗X maps to a function that is normalized w.r.t. X.
pX(x,y)=N∗Xf(x,y) is usually written pX(x|y).
Morevoer, N∗Yf(x,y) is usually written pX(y|x).
By definition, those functions are normalized in the variable on the left of |. Note that, in the second case (pX(y|x)), the positions of the arguments are getting switched up. But of course, as parameters of the expression, still x∈X and y∈Y is intented.
The normalization is often given by an integral ∫dx.
With
pX(x,y)=f(x,y)∫f(x,y)dx
we have that
∫pX(x,y0)dx=1
for all values y0∈Y.
Consider a pair of functions SL,SR:(A→R)→A→R, then
fSLf=fSRfSRfSLf=fSRf1SLSRfSRf1SLSRfSLf
If the pair of functions commute, we can write
fSLf=fSRfSRfSL(SRf)SLfSR(SLf)
Bayes rule: For S the sum/integral over left and right component, we have
SLfSLf=SLfSLf=1
and the above relation reads
N∗Xf(x,y)=N∗Yf(x,y)⋅N∗X∫f(x,y)dyp(y) with p(y)=N∗Y∫f(x,y)dx.
This is Bayes rule for a continuous case and assuming a joint probability kernel f. Here the numerator is the conditional probability for y (also called likelyhood) time the “prior” probability for x.
Using the relation even if you have to make up with the functions in the denominator yourself: Bayes theorem is derived in such a way, but then stated more generally, without reference to the origin of the functions involved. Indeed, in statistical practice, providing the functions in the numerator amounts to specifying the way to improve your best guess about which parameters of some mathematical model of interest should be chosen. As the left hand side must be a probability in x, the denominator on the right hand side is just the normalization.
Having chosen any function to be the conditional probability pmodel(y|x) (as well as one prior, which is less relevant), and given a sequence of inputs yi (observations), Bayes algorithm is moving from one conditional probability in x (the normalized function p(x|{y1,…,yn}) on the left of the theorem) to the next ({y1,…,yn,yn+1}) by successive multiplication with pmodel(ya|x) and re-normalization (so to actually get a probability in x out of it). This is indeed just the repeated application of Bayes theorem.
Philosophically speaking, this optimal updating of believe (when assuming a rigid model pmodel(ya|x)) is a major reason for the mathematical operation of multiplication as such. PS Bayesian inference has the Cox axioms as justification for as a relevant logic of believe.
Wikipedia: Conditional probability