The Math Behind OnlineLogBinning

The online binning functionality works by combining the method described in Section D of Markus Wallerberger (2019), as well as in the Dicussion of Vinay Ambegaokar, Matthias Troyer (2010), where one keeps track of several accumulators, with the first-pass pairwise algorithm from Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983). The online (i.e. O(1)) quantities that are obtained from this process are the Tvalue, $T$, and Svalue, $S$, from Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983), representing the total accumulator and square accumulator, respectively, as well as the total number of bins $m$. Together, these online quantities can be combined at any point to yield other (technically) online statistics like the mean or variance. These statistics are online in the sense that they are simple function of online accumulators, and so we emphasize their calculation is still amortized with complexity O(1). This is despite that the mean, variance, etc. are not updated continuously; only $m$, $T$, and $S$ are.

Using the notation of Ref. Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983), the $T$ and $S$ calculated in a data stream comprised of a sequence of $m$ elements,

\[x_k \in \left\lbrace x_1,x_2,\dots,x_m\right\rbrace\]

is given by

\[T_{1,m} = \sum_{k = 1}^m x_k,\]

\[S_{1,m} = \sum_{k = 1}^m \left(x_k - \bar{x} \right)^2,\]

where the mean is given by

\[\bar{x} = \frac{T_{1,m}}{m}.\]

The variance is then given by

\[\sigma^2 = \frac{S_{1,m}}{m-1}.\]

These two quantities can be computed online with the first-pass pairwise algorithm given an additional two elements $\left\lbrace x_{m+1}, x_{m+2} \right\rbrace$ using the following expressions:

\[T_{1,m + 2} = T_{1,m} + T_{m+1,m+2},\]

\[S_{1,m + 2} = S_{1, m} + S_{m+1, m+2} + \frac{m}{2(m+2)} \left( \frac{2}{m} T_{1,m} - T_{m+1,m+2} \right)^2,\]

where

\[T_{m+1,m+2} = x_{m+1} + x_{m+2},\]

\[S_{m+1, m+2} = \frac{1}{2}\left(x_{m+2} - x_{m+1} \right)^2.\]

We implement this with three nested Accumulator structs: the outermost BinningAccumulator, the middle-level LevelAccumulator, and the innermost PairAccumulator. The BinningAccumulator stores a Vector of LevelAccumulators, each of which store their own PairAccumulator.

The PairAccumulator struct is the outward facing element of the BinningAccumulator in that it takes in data directly from the data stream. After a pair of values has been imported from the stream, then $\left\lbrace T_{m+1,m+2}, S_{m+1,m+2} \right\rbrace$ are computed and exported to the encapsulating LevelAccumulator, where the $\left\lbrace m, T_{1,m}, S_{1,m} \right\rbrace$ accumulator values are stored. Then, the PairAccumulator is reset!. At the same time, the outermost BinningAccumulator passes the $T_{m+1,m+2} / 2$ (ie the pairwise mean) value onto the PairAccumulator in the LevelAccumulator at the next binning level, where the whole process is repeated again, except the accumulated $T_{1,m+2} / 2$ values comprise the new data stream.

The fact that the data stream is processed in pairs before being passed along to the other binning levels inherently leads to a bin_depth given by ${\rm floor}\left[\log_2 (m)\right]$, which is the total number of binning levels in the data stream.