class: center, middle, inverse, title-slide # STA 360/602L: Module 3.2 ## Rejection sampling; Importance sampling ### Dr. Olanrewaju Michael Akande --- ## Rejection sampling - .hlight[Rejection sampling] and .hlight[Importance sampling] are one of the first steps into Monte Carlo analysis, in which simulated values from one distribution are used to explore another. -- - Simulating from the "wrong distribution" can be incredibly useful as we will see in this module and also later in the course. -- - Both are not used very often, but are still of practical interest in + fairly small problems, in terms of dimension, -- + in which the density of the distribution of interest can be easily evaluated, but when it is difficult to sample from directly, and -- + when it is relatively easy to identify and simulate from distributions that approximate the distribution of interest. -- - Importance sampling and Rejection sampling use the same ideas, but the latter leads to exact corrections and so exact samples from the distribution of interest. --- ## Rejection sampling - Setup: + `\(p(\theta)\)` is some density we are interested in sampling from; -- + `\(p(\theta)\)` is tough to sample from but we are able to evaluate `\(p(\theta)\)` as a function at any point; and -- + `\(g(\theta)\)` is some .hlight[proposal distribution] or .hlight[importance sampling distribution] that is easier to sample from. -- - Two key requirements: + `\(g(\theta)\)` is easy to sample from; and -- + `\(g(\theta)\)` is easy to evaluate at any point as is the case for `\(p(\theta)\)`. -- - Usually, the context is one in which `\(g(\theta)\)` has been derived as an analytic approximation to `\(p(\theta)\)`; and the closer the approximation, the more accurate the resulting Monte Carlo analysis will be. --- ## Rejection sampling - Procedure: 1. Define `\(w(\theta) = p(\theta)/g(\theta)\)`. -- 2. Assume that `\(w(\theta) = p(\theta)/g(\theta) < M\)` for some constant M. If `\(g(\theta)\)` represents a good approximation to `\(p(\theta)\)`, then M should not be too far from 1. -- 3. Generate a _candidate_ value `\(\theta \sim g(\theta)\)` and **accept** with probability `\(w(\theta)/M\)`: if accepted, `\(\theta\)` is a draw from `\(p(\theta)\)`; otherwise **reject** and try again. -- Equivalently, generate `\(u \sim U(0,1)\)` independently of `\(\theta\)`. Then **accept** `\(\theta\)` as a draw from `\(p(\theta)\)` if, and only if, `\(u < w(\theta)/M\)`. -- - For those interested, the proof that all accepted `\(\theta\)` values are indeed from `\(p(\theta)\)` is on the next slide. We will not spend time on it. -- - Clearly, we need `\(M\)` for this to work. However, in the case of truncated densities, we actually have `\(M\)`. --- ## Proof for simple accept/reject - We need to show that all accepted `\(\theta\)` values are indeed from `\(p(\theta)\)`. Equivalently, show that `\(f(\theta | u < w(\theta)/M) = p(\theta)\)`. - By Bayes' theorem, .block[ .small[ `$$f(\theta | u < w(\theta)/M) = \dfrac{\Pr(\theta \textrm{ and } u < w(\theta)/M)}{\Pr(u < w(\theta)/M)} = \dfrac{\Pr(u < w(\theta)/M \ | \ \theta) g(\theta)}{\Pr(u < w(\theta)/M)}.$$` ] ] - But, + `\(\Pr(u < w(\theta)/M \ | \ \theta) = w(\theta)/M\)` since `\(u \sim U(0,1)\)`, and + .block[ .small[ $$ `\begin{split} \Pr(u < w(\theta)/M) & = \int \Pr(u < w(\theta)/M \ | \ \theta) g(\theta) d\theta\\ & = \int w(\theta)/M g(\theta) d\theta = 1/M \int w(\theta) g(\theta) d\theta = 1/M \int p(\theta) d\theta = 1/M. \end{split}` $$ ] ] - Therefore, .block[ .small[ `$$f(\theta | u < w(\theta)/M) = \dfrac{\Pr(u < w(\theta)/M \ | \ \theta) g(\theta)}{\Pr(u < w(\theta)/M)} = \dfrac{w(\theta)/M g(\theta)}{1/M} = w(\theta) g(\theta) = p(\theta).$$` ] ] --- ## Rejection sampling for truncated densities - The inverse CDF method works well for truncated densities but what happens when we can not or prefer not to write down the truncated CDF? -- - Suppose we want to sample from `\(f_{[a,b]}(\theta)\)`, that is, a known pdf `\(f(\theta)\)` truncated to `\([a,b]\)`. + Recall that `\(f_{[a,b]}(\theta) \propto f(\theta)\mathbb{1}[\theta \in [a,b]]\)`. Using the notation for rejection sampling, `\(p(\theta) = f_{[a,b]}(\theta)\)` and `\(g(\theta) = f(\theta)\)`. -- + Set `\(1/M = \int^b_a f(\theta^\star)\textrm{d}\theta^\star\)`, so that `\(M\)` is the normalizing constant of the truncated density. -- + Then, `\(w(\theta) = p(\theta)/g(\theta) = M \mathbb{1}[\theta \in [a,b]] \leq M\)` as required. --- ## Rejection sampling for truncated densities - We can then use the procedure on page 5 to generate the required samples. -- - Specifically, -- + For each `\(i=1,\ldots,m\)`, generate `\(\theta_i \sim f\)`. If `\(\theta_i \in [a,b]\)`, accept `\(\theta_i\)`, otherwise **reject** and try again. -- + Easy to show that this is equivalent to accepting each `\(\theta_i\)` with probability `\(w(\theta)/M\)`. --- ## Example ```r #Simple code for using rejection sampling to generate m samples #from the Beta[10,10] density truncated to (0.35,0.6). set.seed(12345) #NOTE: there are more efficient ways to write this code! #set sample size and reate vector to store sample m <- 10000; THETA <- rep(0,m) #keep track of rejects TotalRejects <- 0; Rejections <- NULL #now the 'for loop' for(i in 1:m){ t <- 0 while(t < 1){ theta <- rbeta(1,10,10) if(theta > 0.35 & theta < 0.6){ THETA[i] <- theta t <- 1 } else { TotalRejects <- TotalRejects + 1 Rejections <- rbind(Rejections,theta) } } } #Overall acceptance rate: 1 - TotalRejects/(m+TotalRejects) ``` ``` ## [1] 0.727802 ``` --- ## Example How does our sample compare to the true truncated density? `\(m = 100\)` <img src="3-2-rejection-importance-sampling_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Example How does our sample compare to the true truncated density? `\(m = 1000\)` <img src="3-2-rejection-importance-sampling_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Example How does our sample compare to the true truncated density? `\(m = 10000\)` <img src="3-2-rejection-importance-sampling_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Example How does our sample compare to the true truncated density? `\(m = 100000\)` <img src="3-2-rejection-importance-sampling_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Comments - Clearly less efficient than the inverse CDF method, which we already know how to use for this particular problem. -- - When you can write down the truncated CDF, use the inverse CDF method instead. -- - When you cannot, rejection sampling can be a possible alternative, as are many more sampling methods which we will not cover in this course. -- - Anyway, generally, rejection sampling can still be very useful. -- - Importance sampling is another related sampling method but we will not spend time on it. If you are interested, take a look at the next few slides. If not, feel free to skip. --- class: center, middle # Optional content from here on... --- ## Importance sampling - Interest lies in expectations of the form (instead of the actual samples) .block[ `$$H = \int h(\theta) p(\theta) d\theta,$$` ] - Write .block[ `$$H = \int h(\theta) w(\theta) g(\theta) d\theta \ \ \ \textrm{with} \ \ \ w(\theta) = p(\theta)/g(\theta)$$` ] that is, `\(\mathbb{E}[h(\theta)]\)` under `\(p(\theta)\)` is just `\(\mathbb{E}[h(\theta) w(\theta)]\)` under `\(g(\theta)\)`. - Using direct Monte Carlo integration .block[ `$$\bar{h} = \dfrac{1}{m} \sum_{i=1}^m w(\theta_i) h(\theta_i).$$` ] where `\(\theta_1, \ldots, \theta_m \overset{ind}{\sim} g(\theta)\)`. We are sampling from the "wrong" distribution. --- ## Importance sampling - The measure of "how wrong" we are at each simulated `\(\theta_m\)` value is the .hlight[importance weight] .block[ `$$w(\theta_i) = p(\theta_i)/g(\theta_i).$$` ] -- - These ratios weight the sample estimates `\(h(\theta_i)\)` to "correct" for the fact that we sampled the wrong distribution. - See [Lopes & Gamerman (Ch 3.4)](https://www.amazon.com/Markov-Chain-Monte-Carlo-Statistical/dp/1584885874) and [Robert and Casella (Ch. 3.3)](https://www.amazon.com/Monte-Statistical-Methods-Springer-Statistics/dp/1441919392) for discussions of convergence and optimality. -- - Clearly, the closer `\(g\)` is to `\(p\)`, the better the results, just as we had with rejection sampling. --- ## Importance sampling - Key considerations: + MC estimate `\(\bar{h}\)` has the expectation `\(H\)`; and is generally almost surely convergent to `\(H\)` (under certain conditions of course but we will not dive into those). -- + `\(\mathbb{V}[\bar{h}]\)` is often going to be finite in cases in which, generally, `\(w(\theta) = p(\theta)/g(\theta)\)` is bounded and decays rapidly in the tails of `\(p(\theta)\)`. -- + Thus, superior MC approximations, are achieved for choices of `\(g(\theta)\)` whose tails dominate those of the target `\(p(\theta)\)`. -- + That is, importance sampling distributions should be chosen to have tails at least as fat as the target (think normal distribution vs t-distribution). -- + Obviously require the support of `\(g(\theta)\)` to be the same as, or contain, that of `\(p(\theta)\)`. -- - These also clearly apply to rejection sampling too. --- ## Importance sampling - Problems in which `\(w(\theta) = p(\theta)/g(\theta)\)` can be computed are actually rare. -- - As you will see when we move away from conjugate distributions, we usually only know `\(p(\theta)\)` up to a normalizing constant. -- - When this is the case, simply "re-normalize" the importance weights, so that .block[ `$$\bar{h} = \dfrac{1}{m} \sum_{i=1}^m w_i h(\theta_i) \ \ \ \textrm{where} \ \ \ w_i = \dfrac{w(\theta_i)}{\sum^m_{i=1}w(\theta_i)}.$$` ] -- - Generally, in importance sampling, weights that are close to uniform are desirable, and very unevenly distributed weights are not. --- class: center, middle # What's next? ### Move on to the readings for the next module!