class: center, middle, inverse, title-slide # STA 360/602L: Module 1.2 ## Probability review ### Dr. Olanrewaju Michael Akande --- ## Outline - Random variables - Joint distributions - Independence - Exchangeability --- ## Discrete random variables - A .hlight[random variable] is .hlight[discrete] if the set of all possible outcomes is .hlight[countable]. -- - The .hlight[probability mass function (pmf)] of a discrete random variable `\(Y\)`, `\(p(y)\)` describes the probability associated with each possible value of `\(Y\)`. -- - `\(p(y)\)` has the following properties: 1. `\(0 \leq p(y) \leq 1\)` for all values `\(y \in \mathcal{Y}\)`. 2. `\(\sum_{y \in \mathcal{Y}} p(y) = 1\)`. -- - Most distributions are often charaterized by some parameter (or set/vector of parameters) `\(\theta\)`. -- - So, to make this clear, we will often write the pmf instead as `\(p(y | \theta)\)`. --- ## Bernoulli distribution - The .hlight[Bernoulli distribution] can be used to describe an experiment with two outcomes, such as + Flipping a coin (heads or tails); + Vote turnout (vote or not); and + The outcome of a basketball game (win or loss). -- - In all cases, we can represent this as a binary random variable where the probability of "success" is `\(\theta\)` and the probability of "failure" is `\(1-\theta\)`. -- - We usually write this as: `\(Y \sim \textrm{Bernoulli}(\theta)\)`, where `\(\theta \in [0,1]\)`. -- - It follows that .block[ .small[ `$$p(y|\theta) = \Pr(Y=y|\theta) = \theta^y(1-\theta)^{1-y}; \ \ \ y=0,1.$$` ] ] -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## Binomial distribution - The .hlight[binomial distribution] describes the number of successes from `\(n\)` independent Bernoulli trials. -- - That is, `\(Y =\)` number of "successes" in `\(n\)` independent trials and `\(\theta\)` is the probability of success per trial. -- - We usually write this as: `\(Y \sim \textrm{Bin}(n,\theta)\)`, where `\(\theta \in [0,1]\)`. -- - The pmf is .block[ .small[ `$$p(y|\theta) = \Pr(Y=y|\theta,n) = {n \choose y} \theta^y(1-\theta)^{n-y}; \ \ \ y=0,1,\ldots,n.$$` ] ] -- - **Example**: `\(Y =\)` number of individuals with type I diabetes out of a sample of `\(n\)` surveyed. -- - Binomial likelihoods are commonly used in collecting data on proportions. -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## Poisson distribution - `\(Y \sim \textrm{Po}(\theta)\)` denotes that `\(Y\)` is a .hlight[Poisson random variable]. -- - The Poisson distribution is commonly used to model count data consisting of the number of events in a given time interval. -- - The Poisson distribution is parameterized by `\(\theta\)` and the pmf is given by .block[ .small[ `$$p(y|\theta) - \Pr[Y = y | \theta] = \dfrac{\theta^y e^{-\theta}}{y!}; \ \ \ \ y=0,1,2,\ldots; \ \ \ \ \theta > 0.$$` ] ] -- - Similar to binomial but with no limit on the total number of counts. -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## General discrete distributions - Useful to consider general discrete distributions having an arbitrary form. -- - Suppose `\(Y \in \{y_1^\star,\ldots,y_k^\star\}\)`. Then define `\(\Pr(Y = y_h^\star) = \pi_h\)` for each `\(h = 1,\ldots, k\)`. That is, .block[ .small[ `$$p(y|\boldsymbol{\pi}) = \Pr[Y = y| \boldsymbol{\pi}] = \prod_h \pi_h^{\mathbb{1}[Y = y_h^\star]}; \ \ y \in \{y_1^\star,\ldots,y_k^\star\}$$` ] ] where `\(\boldsymbol{\pi} = (\pi_1,\ldots,\pi_k)\)`. -- - `\((y_1^\star,\ldots,y_k^\star)\)` are "atoms" representing possible values for `\(Y\)`. -- - For example, these may be words in a dictionary or values for education as a categorical variable. Useful for text data, categorical observations, etc. -- - Can also write as `\(Y \sim \sum^k_{h=1} \pi_h \delta_{y_h^\star}\)`, where `\(\delta_{y_h^\star}\)` denotes a unit mass at `\(y_h^\star\)`. -- - Often called the .hlight[categorical distribution] or .hlight[generalized Bernoulli distribution]. Also, see the .hlight[multinomial distribution]. --- ## Continuous random variables - The .hlight[probability density function (pdf)], `\(p(y)\)` or `\(f(y)\)`, of a continuous random variable `\(Y\)` has slightly different properties: 1. `\(0 \leq f(y)\)` for all `\(y \in \mathcal{Y}\)`. 2. `\(\int_{y \in \mathbb{R}} f(y) \textrm{d}y = 1\)`. -- - The pdf for a continuous random variable is not necessarily less than 1. -- - Also, `\(f(y)\)` is NOT the probability of value `\(y\)`. -- - However, if `\(f(y_1) > f(y_2)\)`, we say informally that `\(y_1\)` has a "higher probability" than `\(y_2\)`. -- - As we did in the discrete case, we will also often write the pdf instead as `\(f(y | \theta)\)` or `\(p(y | \theta)\)` to make the conditioning obvious. --- ## Uniform density - The simplest example of a continuous density is the .hlight[uniform density]. -- - `\(Y \sim \textrm{Unif}(a,b)\)` denotes density is uniform in interval `\((a,b)\)`. -- - The pdf is simply .block[ .small[ `$$f(y | a,b) = \dfrac{1}{b-a}; \ \ \ y \in (a,b).$$` ] ] -- - The cdf is .block[ .small[ `$$F(y) = \Pr(Y \leq y) = \int^y_a \dfrac{1}{b-a} \textrm{d}z = \dfrac{y-a}{b-a}$$` ] ] -- - The mean (expectation) is .block[ .small[ `$$\dfrac{a+b}{2}$$` ] ] - <div class="question"> What is the variance? Also, can you prove the formula for the mean? </div> --- ## Beta density - The uniform density can be used as a prior for a probability if `\((a,b) \subset (0,1)\)`. -- - However, it is very inflexible clearly. <div class="question"> Why? </div> -- - An alternative for `\(y \in \mathcal{Y}\)` is the .hlight[beta density], written as `\(Y \sim \textrm{Beta}(a,b)\)`, with .block[ .small[ `$$f(y | a,b) = \frac{1}{B(a,b)} y^{a-1} (1-y)^{b-1}; \ \ \ y \in (0,1), \ a > 0, \ b > 0.$$` ] ] where `\(B(a,b) = \dfrac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\)`. `\(\Gamma(n) = (n-1)!\)` for any positive integer `\(n\)`. -- - As we have already seen, the beta density is quite flexible in characterizing a broad variety of densities on `\((0,1)\)`. -- - <div class="question"> Beta(1,1) is the same as Unif(0,1). Workout the pdfs to convince yourself! </div> --- ## Gamma density - The .hlight[gamma density] will be useful as a prior for parameters that are strictly positive. -- - For random variables `\(Y \sim \textrm{Ga}(a,b)\)`, we have the pdf .block[ .small[ `$$f(y | a,b) = \frac{b^a}{\Gamma(a)} y^{a-1}e^{-by}; \ \ \ y \in (0,\infty), \ a > 0, \ b > 0.$$` ] ] -- - Properties: .block[ .small[ `$$\mathbb{E}[Y] = \dfrac{a}{b}; \ \ \mathbb{V}[Y] = \dfrac{a}{b^2}.$$` ] ] -- - **Note**: there are multiple parameterizations of the gamma distribution. We will rely on this version in this course. -- - Under this parameterization, `\(a\)` is known as the shape parameter, while `\(b\)` is known as the rate parameter. -- - Under this parameterization, if `\(Y \sim \textrm{Ga}(1,\theta)\)`, then `\(Y \sim \textrm{Exp}(\theta)\)`, that is, the .hlight[exponential distribution]. --- ## Continuous joint distributions - Suppose we have two random variables `\(\theta = (\theta_1,\theta_2)\)`. -- - Their **joint distribution function** is .block[ .small[ `$$\Pr(\theta_1 \leq a,\theta_2 \leq b) = \int^a_{-\infty} \int^b_{-\infty} f(\theta_1,\theta_2) \textrm{d}\theta_1\textrm{d}\theta_2,$$` ] ] where `\(f(\theta_1,\theta_2)\)` is the **joint pdf**. -- - The **marginal** density of `\(\theta_1\)` can be obtained by .block[ .small[ `$$f(\theta_1) = \int^\infty_{-\infty} f(\theta_1,\theta_2) \textrm{d}\theta_2,$$` ] ] which is referred to as marginalizing out `\(\theta_2\)`. -- - We will be doing a lot of "marginalizations", so take note! --- ## Factorizing joint densities and independence - The joint density `\(f(\theta_1,\theta_2)\)` can be factorized as .block[ .small[ `$$f(\theta_1,\theta_2) = f(\theta_1|\theta_2)f(\theta_2), \ \ \ \textrm{or} \ \ \ f(\theta_1,\theta_2) = f(\theta_2|\theta_1)f(\theta_1).$$` ] ] -- - For independent random variables, the joint density equals the product of the marginals: .block[ .small[ `$$f(\theta_1,\theta_2) = f(\theta_1)f(\theta_2).$$` ] ] -- - This implies that `\(f(\theta_2|\theta_1) = f(\theta_2)\)` and `\(f(\theta_1|\theta_2) = f(\theta_1)\)` under independence. -- - These relationships extend automatically to `\(\theta = (\theta_1,\ldots,\theta_p)\)`. That is, .block[ .small[ `$$f(\theta_1,\ldots,\theta_p) = \prod^p_{j=1} f(\theta_j),$$` ] ] under mutual independence of the elements of the `\(\theta\)` vector. --- ## Conditional independence - Suppose `\(y_i \overset{iid}{\sim} f(y_i | \theta)\)` for `\(i = 1,\ldots,n\)`. -- - Data `\(\{y_i\}\)` are independent & identically distributed draws from distribution `\(f(y_i | \theta)\)`. -- - The data are said to be .hlight[conditionally independent] given `\(\theta\)` if .block[ .small[ `$$f(y_1,\ldots,y_n | \theta) = \prod^n_{i=1} f(y_i | \theta).$$` ] ] -- - `\(f(y_1,\ldots,y_n | \theta)\)` is also the likelihood function `\(L(\theta | y)\)` of the data. -- - The .hlight[marginal likelihood] of the data is .block[ .small[ `$$L(y) = f(y_1,\ldots,y_n) = \int_\Theta f(y_1,\ldots,y_n | \theta) p(\theta)\textrm{d}\theta = \int_\Theta L(\theta | y)p(\theta)\textrm{d}\theta.$$` ] ] -- - Here, `\(L(y)\)` can not be written as a product of densities as in `\(\prod\limits^n_{i=1} f(y_i)\)`; we lose independence when we marginalize out `\(\theta\)`. --- ## Exchangeability - In marginalizing out `\(\theta\)`, the observations `\(\{y_i\}\)` are not marginally independent. -- - `\(\{y_i\}\)` are .hlight[exchangeable] if `\(f(y_1,\ldots,y_n) = f(y_{\pi_1},\ldots,y_{\pi_n})\)`, for all permutations `\(\pi\)` of `\(\{1,\ldots,n\}\)`. -- - .hlight[de Finetti's Theorem]: Suppose `\(\{y_i\}\)` are exchangeable under above definition for any `\(n\)`. Then .block[ .small[ `$$f(y_1,\ldots,y_n) = \int_\Theta \left[ \prod^n_{i=1} f(y_i| \theta) \right] p(\theta)\textrm{d}\theta.$$` ] ] for some `\(\theta\)`, prior distribution `\(p(\theta)\)` and sampling model `\(f(y_i|\theta)\)`. -- - Simply put, de Finetti's Theorem states that exchangeable observations are conditionally independent relative to some parameter. -- - de Finetti's Theorem is critical in providing a motivation for using parameters and for putting priors on parameters. --- class: center, middle # What's next? ### Move on to the readings for the next module!