The Math Behind OnlineLogBinning
The online binning functionality works by combining the method described in Section D of Markus Wallerberger (2019), as well as in the Dicussion of Vinay Ambegaokar, Matthias Troyer (2010), where one keeps track of several accumulators, with the first-pass pairwise algorithm from Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983). The online (i.e. O(1)
) quantities that are obtained from this process are the Tvalue
, $T$, and Svalue
, $S$, from Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983), representing the total accumulator and square accumulator, respectively, as well as the total number of bins $m$. Together, these online quantities can be combined at any point to yield other (technically) online statistics like the mean
or var
iance. These statistics are online in the sense that they are simple function of online accumulators, and so we emphasize their calculation is still amortized with complexity O(1)
. This is despite that the mean
, var
iance, etc. are not updated continuously; only $m$, $T$, and $S$ are.
Using the notation of Ref. Tony F. Chan, Gene H. Golub, Randall J. LeVeque (1983), the $T$ and $S$ calculated in a data stream comprised of a sequence of $m$ elements,
\[x_k \in \left\lbrace x_1,x_2,\dots,x_m\right\rbrace\]
is given by
\[T_{1,m} = \sum_{k = 1}^m x_k,\]
\[S_{1,m} = \sum_{k = 1}^m \left(x_k - \bar{x} \right)^2,\]
where the mean is given by
\[\bar{x} = \frac{T_{1,m}}{m}.\]
The variance is then given by
\[\sigma^2 = \frac{S_{1,m}}{m-1}.\]
These two quantities can be computed online with the first-pass pairwise algorithm given an additional two elements $\left\lbrace x_{m+1}, x_{m+2} \right\rbrace$ using the following expressions:
\[T_{1,m + 2} = T_{1,m} + T_{m+1,m+2},\]
\[S_{1,m + 2} = S_{1, m} + S_{m+1, m+2} + \frac{m}{2(m+2)} \left( \frac{2}{m} T_{1,m} - T_{m+1,m+2} \right)^2,\]
where
\[T_{m+1,m+2} = x_{m+1} + x_{m+2},\]
\[S_{m+1, m+2} = \frac{1}{2}\left(x_{m+2} - x_{m+1} \right)^2.\]
We implement this with three nested Accumulator
struct
s: the outermost BinningAccumulator
, the middle-level LevelAccumulator
, and the innermost PairAccumulator
. The BinningAccumulator
stores a Vector
of LevelAccumulators
, each of which store their own PairAccumulator
.
The PairAccumulator
struct
is the outward facing element of the BinningAccumulator
in that it takes in data directly from the data stream. After a pair of values has been imported from the stream, then $\left\lbrace T_{m+1,m+2}, S_{m+1,m+2} \right\rbrace$ are computed and exported to the encapsulating LevelAccumulator
, where the $\left\lbrace m, T_{1,m}, S_{1,m} \right\rbrace$ accumulator values are stored. Then, the PairAccumulator
is reset!
. At the same time, the outermost BinningAccumulator
passes the $T_{m+1,m+2} / 2$ (ie the pairwise mean) value onto the PairAccumulator
in the LevelAccumulator
at the next binning level, where the whole process is repeated again, except the accumulated $T_{1,m+2} / 2$ values comprise the new data stream.
The fact that the data stream is processed in pairs before being passed along to the other binning levels inherently leads to a bin_depth
given by ${\rm floor}\left[\log_2 (m)\right]$, which is the total number of binning levels in the data stream.