Conditional probability

Function

context	$f:X\times Y\to {\mathbb R}$ … measurable in $X$
definition	$p_X: X\times Y\to {\mathbb R}$
definition	$p_X: N_X^*f(x,y)$

This is really just the conditional probability when coming from a joint “probability kernel”, i.e a function $f(x,y)$.
The operator $N_X^*$ maps to a function that is normalized w.r.t. $X$.

Discussion

Notation

$p_X(x,y) = N_X^*f(x,y)$ is usually written $p_X(x|y)$.

Morevoer, $N_Y^*f(x,y)$ is usually written $p_X(y|x)$.

By definition, those functions are normalized in the variable on the left of $|$. Note that, in the second case ($p_X(y|x)$), the positions of the arguments are getting switched up. But of course, as parameters of the expression, still $x\in X$ and $y\in Y$ is intented.

Examples

The normalization is often given by an integral $\int{ \mathrm d}x$.

Note

With

$p_X(x,y) = \dfrac{f(x,y)}{\int f(x,y)\, {\mathrm d}x}$

we have that

$\int p_X(x,y_0)\, {\mathrm d}x = 1$

for all values $y_0\in Y$.

Theorem

Consider a pair of functions $S_L, S_R : (A\to {\mathbb R})\to A\to {\mathbb R}$, then

$\dfrac{f}{S_Lf} = \dfrac{f}{S_Rf}\dfrac{S_Rf}{S_Lf} = \dfrac{f}{S_Rf}\dfrac{\frac{1}{S_LS_Rf}S_Rf}{\frac{1}{S_LS_Rf}S_Lf}$

If the pair of functions commute, we can write

$\dfrac{f}{S_Lf} = \dfrac{f}{S_Rf}\dfrac{\frac{S_Rf}{S_L(S_Rf)}}{\frac{S_Lf}{S_R(S_Lf)}}$

Bayes rule: For $S$ the sum/integral over left and right component, we have

$S_L\dfrac{f}{S_Lf}=\dfrac{S_Lf}{S_Lf}=1$

and the above relation reads

$N_X^*f(x,y) = \dfrac{N_Y^*f(x,y)\,\cdot\,N_X^*\int f(x,y)\, {\mathrm d}y}{p(y)}$ with $p(y) = N_Y^*\int f(x,y)\, {\mathrm d}x$.

This is Bayes rule for a continuous case and assuming a joint probability kernel $f$. Here the numerator is the conditional probability for $y$ (also called likelyhood) time the “prior” probability for $x$.

Using the relation even if you have to make up with the functions in the denominator yourself: Bayes theorem is derived in such a way, but then stated more generally, without reference to the origin of the functions involved. Indeed, in statistical practice, providing the functions in the numerator amounts to specifying the way to improve your best guess about which parameters of some mathematical model of interest should be chosen. As the left hand side must be a probability in $x$, the denominator on the right hand side is just the normalization.

Having chosen any function to be the conditional probability $p_{model}(y|x)$ (as well as one prior, which is less relevant), and given a sequence of inputs $y_i$ (observations), Bayes algorithm is moving from one conditional probability in $x$ (the normalized function $p(x|\{y_1,\dots,y_n\})$ on the left of the theorem) to the next ($\{y_1,\dots,y_n,y_{n+1}\}$) by successive multiplication with $p_{model}(y_a|x)$ and re-normalization (so to actually get a probability in $x$ out of it). This is indeed just the repeated application of Bayes theorem.

Philosophically speaking, this optimal updating of believe (when assuming a rigid model $p_{model}(y_a|x)$) is a major reason for the mathematical operation of multiplication as such. PS Bayesian inference has the Cox axioms as justification for as a relevant logic of believe.

Reference

Wikipedia: Conditional probability

Context

Probability space

Drastic measures, Bayes algorithm