21 Oct 2016 |
水塘抽样(Reservoir sampling) 是随机算法的一种。其目的在于从包含n个项目的集合S中选取k个样本，其中n为一很大或未知的数量，尤其适用于不能把所有n个项目都存放到主内存的情况。
- Keep the first item in memory.
- When the i-th item arrives (for i>1):with probability 1/i, keep the new item (discard the old one)with probability 1-1/i, keep the old item (ignore the new one)
- when there is only one item, it is kept with probability 1;
- when there are 2 items, each of them is kept with probability 1/2;
- when there are 3 items, the third item is kept with probability 1/3, and each of the previous 2 items is also kept with probability (1/2)(1-1/3) = (1/2)(2/3) = 1/3;
- by induction, it is easy to prove that when there are n items, each item is kept with probability 1/n.
考虑一般情况，Algorithm R by Jeffrey Vitter：
/* S has items to sample, R will contain the result */ ReservoirSample(S[1..n], R[1..k]) // fill the reservoir array for i = 1 to k R[i] := S[i] // replace elements with gradually decreasing probability for i = k+1 to n j := random(1, i) // important: inclusive range if j <= k R[j] := S[i]