Subset Sampling and Its Extensions

07/21/2023
āˆ™
by   Jinchao Huang, et al.
āˆ™
0
āˆ™

This paper studies the subset sampling problem. The input is a set š’® of n records together with a function p that assigns each record vāˆˆš’® a probability p(v). A query returns a random subset X of š’®, where each record vāˆˆš’® is sampled into X independently with probability p(v). The goal is to store š’® in a data structure to answer queries efficiently. If š’® fits in memory, the problem is interesting when š’® is dynamic. We develop a dynamic data structure with š’Ŗ(1+μ_š’®) expected query time, š’Ŗ(n) space and š’Ŗ(1) amortized expected update, insert and delete time, where μ_š’®=āˆ‘_vāˆˆš’®p(v). The query time and space are optimal. If š’® does not fit in memory, the problem is difficult even if š’® is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in š’Ŗ((log^*_B n)/B+(μ_š’®/B)log_M/B (n/B)) amortized expected I/Os using š’Ŗ(n/B) space, where M is the memory size, B is the block size and log^*_B n is the number of iterative log_2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with š’Ŗ(log n+μ_š’®āˆ©[a,b]) expected query time, š’Ŗ(n) space and š’Ŗ(log n) amortized expected update, insert and delete time.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro