Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

07/16/2020
βˆ™
by   Seth Pettie, et al.
βˆ™
0
βˆ™

In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (π–₯π—‚π—Œπ—) number β„‹/ℐ. It captures the tension between the limiting Shannon entropy (β„‹) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖒𝖲𝖠 sketch of Flajolet and Martin have π–₯π—‚π—Œπ— number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as qβ†’βˆž. All other known sketches have even worse π–₯π—‚π—Œπ—-numbers. We introduce a new sketch called π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ that is based on a smoothed, compressed version of 𝖯𝖒𝖲𝖠 with a different estimation function. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ has π–₯π—‚π—Œπ— number H_0/I_0 β‰ˆ 1.98. It stores O(log^2log U) + (H_0/I_0)b β‰ˆ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹'s space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the π–₯π—‚π—Œπ—-number H_0/I_0 is a universal lower bound for any such composable sketch.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro