10 Tips and Tricks for Statistical Proofs

I’ve been taking probability theory this year and I noticed that a lot of proofs will assume that the reader already knows some commonly used “tricks.” If you aren’t familiar with them, it can be hard to follow the proofs in the textbook,1 let alone prove it yourself. I felt like this was happening to me a lot, so in an effort to better familiarize myself, I’ve written down some useful tips and tricks, along with explanations and/or examples.2

  1. Bounding the Union with a Sum
  • You can always bound the probability of a union by the sum of the probabilities: P(nAn)nP(An) This is a very simple trick that uses the inclusion-exclusion principle. When in doubt, try this! It’s not a very strong upper bound, but it may be sufficient for your needs.
  1. Cauchy-Schwarz Inequality
  • The Cauchy-Schwarz Inequality is pretty much unavoidable in math. Here are some canonical forms: |u,v|uv|iuivi|2j|uj|2k|vk|2 A form you might encounter in statistics is the following: |E(XY)|2E(X2)E(Y2)

  • Example: If XPoisson(λ) and YPoisson(2λ), find a bound on P(XY) of the form Aexp(cλ). First observe that P(XY)=P(et(XY)1)E(et(XY)) by Chebyshev’s inequality. Then E(et(XY))=E(etXetY)E(e2tX)E(e2tY) by Cauchy-Schwarz. The last equation gives us e(λ(e2t1)+2λ(e(2t)1))/2 with moment-generating functions. Take the derivative with respect to t and plug back in to get the bound.

  1. Continuous Mapping Theorem
  • If XnX (converges in distribution) and g is a measurable function that is continuous except on a set of measure 0, then g(Xn)g(X). If g is furthermore bounded, then Eg(Xn)Eg(X). The result is often something you hope is true and you just need to check that the function you’re applying is in fact continuous.
  1. Convergence Theorems
  • A “last step” in many proofs is to take the limit inside the integral:

    limnXndμ=limnXndμ

    (An equivalent formulation is limnE(Xn)=E(limnXn) or E(Xn)E(X), where X is the limit of Xn.)

  • This is true under any of these three scenarios:

    1. If Xn0 and XnXn+1 for all n (monotone convergence)
    2. If |Xn|M for some constant M (bounded convergence)
    3. If |Xn|Y for an integrable random variable Y, i.e. E|Y|< (dominated convergence)
  • Example (dominated convergence): Consider the probability space (Ω,F,μ). Let X0 be an integrable random variable and define v(A)=AXdμ for AF. To show that v is a measure, a necessary condition is that v(i=1Ai)=i=1v(Ai) for disjoint Ai. This is true since

    i=1v(Ai)=limni=1nv(Ai)=limni=1nAiXdμ=limni=1nX1Aidμ=limni=1nX1Aidμ since Ai disjoint=limni=1nX1Aidμ since i=1nX1AiX (dominated convergence)=i=1X1Aidμ=i=1AiXdμ=v(i=1Ai)

  1. Expectation of Positive Random Variables
  • If X is a nonnegative random variable, then E(X)=n=0P(Xn) or E(X)=P(X>x)dx. A proof is available on Wikipedia. This is a really useful property and is often used for stopping times. More generally, E(Xp)=pyp1P(X>x)dx for X0 and p>0.

  • Example: If τ is a stopping time wrt Fn, s.t. P(τn+3|Fn)>0.1 for every n, then E(τ)<.

    This can be proved by first showing that P(τ>3k)(10.1)k, which can be done using induction. Then since E(τ)=n=0P(τn), it follows that E(τ)<3k=0P(τ>3k)<.

  1. Jensen’s Inequality
  • Jensen’s inequality is something you should be able to recognize and apply right away. Intuitively, it says that if a function ϕ is convex, then the average of the function values is greater than the function applied to the average.3 E[ϕ(X)]ϕ(E[X]) One of the most common applications is with Euler’s number: E[eX]eE(X)

  • Example: If Xn is a martingale, then (Xn+c)2 is a submartingale. Let g(x)=(x+c)2, which is convex. Then by Jensen’s inequality, E(g(Xn+1)|Fn)g(E(Xn+1|Fn)). Since Xn is a martingale, g(E(Xn+1|Fn))=g(Xn)=(Xn+c)2.

  1. Markov’s (Chebyshev’s) Inequality
  • Markov’s inequality is often used whenever you need to bound the probability. It states that a2P(|X|a)E(X2;|X|a)E(X2) More generally, for ϕ:RR, ϕ0, AB, and iA=inf{ϕ(y):yA}, iAP(XA)E(ϕ(X);xA)E(ϕ(X))

  • Example: See the example for Cauchy-Schwarz (#2).

  1. Slutsky’s Theorem
  • This theorem states that if XnX (in distribution) and Ync in probability, then Xn+YnX+c, XnYncX, and Xn/YnY/c. It is not always easy to immediately see that you will need to use this theorem. However, you might be able to prove something close to the answer and realize that if you then apply Slutsky’s, the proof will be completed.

  • Example: Let X1,X2, are iid with EXi=0 and EXi2=σ2<, Nn b a sequence of nonnegative integer-valued random variables, and an a sequence of integers with an and Nn/an1 in probability. If Sn=i=1nXi, show that Yn=SNn/σanN(0,1).

    One way to prove this is to first show that Zn=San/σanN(0,1), which is easy to do. Then if you can show that YnZn0 in probability, which is harder to do, you can conclude with Slutsky’s Theorem to say that Yn=(YnZn)+ZnN(0,1).

  1. Taylor Expansion & Euler’s Number
  • The Taylor series is often used as an approximation and bound. In particular, the Taylor series for Euler’s number e is very important: ex=1+x+x22!+x33!+

  • Another useful property of e is that if ana, then ea=limn(1+ann)n

  • Example: See the example for characteristic functions (Bonus #2).

  1. Truncating an Infinite Sum
  • To get bounds on an infinite sum, you will generally want to truncate it n=1=n=1N+n>N The idea is that you can bound the two sums on the right individually. First, you can bound n=1N because the individual summands are bounded by some ϵ, so the whole sum is bounded by ϵN. Second, you can bound the rest of the sum n>N as N.

  • Example: Let μn,μ be probability measures on N. Show that if μn(x)μ(x) for each xN, then the total variation distance d(μn,μ)=12x=1|μn(x)μ(x)|0.

    If we just tried to bound the infinite sum directly using the assumption, we would get a bound of ϵ. Therefore, we need to split the infinite sum. 12x=1|μn(x)μ(x)|12x=1N|μn(x)μ(x)|+12x=N|μn(x)μ(x)|=An+Bn Now we can bound the finite sum “An” and the rest of the sum “Bn” separately.

    1. Since μn(x)μ(x), bound |μn(x)μ(x)|<ϵ1 for nN1. Thus, An<ϵ1N.
    2. Since μn,μ are probability measures, then both P(|Xn|>N) and P(|X|>N)0 as N, where Xn has distribution μn and X has distribution μ. In other words, z=Nμn(z)<ϵ2 and z=Nμ(z)<ϵ2 for N.4 Then z=N2|μn(x)μ(x)|<2ϵ2. Thus, Bn<ϵ2.

    To conclude, we have shown that An+Bn<ϵ1N+ϵ20. Since N is finite and we can make ϵ1,ϵ2 arbitrarily small, then An+Bn0.

Bonus

Here are two additional useful techniques. I’ve listed them separately because it seems a bit of a stretch to call them “tricks.”5

  1. Borel-Cantelli Lemmas
  • The first Borel-Cantelli Lemma states that if n=1P(An)<, then P(An i. o.)=0. This is often used to prove almost sure convergence. For example, if you can show that n=1P(|XnX|>ϵ)<, then by the lemma, P(|XnX|>ϵ i.o.)=0, which implies that XnX almost surely by definition.
  • The second Borel-Cantelli Lemma states that if An are mutually independent and n=1P(An)=, then P(An i.o.)=1. Similarly, this can be used to prove that Xn doesn’t converge to X almost surely. It’s a little less useful than B-C 1 because of the extra requirement of independence and the fact that you usually want to prove something converges rather than doesn’t converge.
  1. Characteristic Functions
  • There is a one to one correspondence between a characteristic function ϕ(t) and a distribution, so it’s sometimes useful to use characteristic functions to prove something about a distribution. By definition, the characteristic function is ϕ(t)=Eeitx=EcostX+iEsintX Note that it is a function of t and that it lies within the unit circle |ϕ(t)|E|eitX|=1.

  • Example: We can prove the central limit theorem with characteristic functions. Assume that X1,X2, are iid with EXi=0 and var(Xi)=σ2<. The central limit theorem states that if Sn=Xi, Snσn1/2χN(0,1) To show this, note that by Taylor expansion6, the characteristic function of Xi is ϕ(t)=Eexp(itXi)=1+itEXt2E(X2)2+o(t2)Eexp(itSnσn1/2)=[EexpitXiσn1/2]n=[1t22n+σ(n1)]n By property of Euler’s number (see #9), the last term exp(t22), which is the characteristic function of a standard normal distribution.


  1. Specifically, Durrett’s Probability Theory and Examples textbook.↩︎

  2. This guide was also inspired by Daniel Seita’s Mathematical Tricks Commonly Used in Machine Learning and Statistics.↩︎

  3. For more intuitive explanations, check out this post on quora.↩︎

  4. To be formal, you can use separate N’s to bound μn and μ and then use the max as your N.↩︎

  5. And also because “12 Tips and Tricks” is somewhat less catchy than “10 Tips and Tricks.”↩︎

  6. Technically, you have to assume E|X|3< for a direct application of the Taylor expansion, but with some tedious math, it can be shown that E|X|2< is sufficient.↩︎