UncertainHistogramming GithubLink

Stable Dev Build Status

Have you ever had a situation where you need to visualize a set of data as a histogram, except the data you have to visualize are each endowed with some amount of uncertainty? If so, this package is for you! UncertainHistogramming.jl is a lightweight Julia package to plot a density function for a given set of values with known uncertainties.

Background Information

An example application of the main exported abstract struct, ContinuousHistogram, is to visualize a "histogram" of experimental values, when each value has a measured experimental uncertainty. This is to be contrast with normal Histogramming that assumes that each value is exact, meaning its uncertainty is zero().

For me, the need for this package first came about when I was running Monte Carlo simulations, where I needed to understand the underlying distribution of some observables. But, as anybody who has ever played around with Monte Carlo methods knows, each observable has a certain amount of statistical error. Thus, any regular histogram I would make when ignoring these statistical errors would not really expose the true distribution, as each data point could not entirely be claimed by a single histogram bin. So I invented the ContinuousHistogram as a somewhat tongue-in-cheek generalization of the regular histogram that takes data uncertainty into account.

This package provides similar functionality to what is expected from kernel density estimation (KDE), but here the data errors/uncertainties which act as the kernel bandwidths are all, in principle, different.

Note

A ContinuousHistogram is continuous in the sense of its domain. This is admittedly a bit confusion, but the discretization that occurs in a regular histogram comes from its bins, or its domain, not its range. Of course, the range, or vertical values, are jumpy, but that is because of the discrete nature of the regular histogram. Most kernel functions that exist are at least piecewise continuous in their range, which is the same standard we take here.

Available ContinuousHistograms

We currently offer the following ContinuousHistograms which implement their designated KernelDistribution and kernel functions:

\[G(y; \mu_i, \sigma_i) = \frac{ \exp\left[ -\frac{ \left( y - \mu_i \right)^2 }{2 \sigma_i^2} \right] }{ \sigma_i \sqrt{2\pi}}.\]

\[\mathcal{U}(y; x_i, \epsilon_i) = \begin{cases} \frac{1}{2\epsilon_i}, & y \in (x_i - \epsilon_i, x_i + \epsilon_i) \\ 0, & \mathrm{otherwise} \end{cases}.\]

Each ContinuousHistogram are built around value-error pairs. For example, with the GaussianHistogram, the value-error pair are the mean and standard deviation of that gaussian.

Example Usage

An example ContinuousHistogram can be constructed from the following Julia code for a simple Vector of value-error Tuples.

To start, first include the following packages:

using Plots # One must Pkg.add this separately
using UncertainHistogramming
Plots.gr()  # Use GR to reproduce the plot exactly

Next, define a list of (value, error)-Tuples:

values_errors = [(-3.5, 0.5),
                 (-1.5, 0.75),
                 (0, 0.25),
                 (1.5, 0.75),
                 (3.5, 0.5)]

From here, we're in the position to initialize both a GaussianHistogram ghist and a UniformHistogram uhist, and then push! the values_errors Vector into them.

ghist = GaussianHistogram()
uhist = UniformHistogram()
push!(ghist, values_errors)
GaussianHistogram{Float64}:
  length  = 5
  moments = 1.1102230246251565e-16  6.1375  1.7763568394002505e-15  72.89453125  

  Statistics
    mean        = 1.1102230246251565e-16
    variance    = 6.1375
    skewness    = -1.7615310149552143e-17
    kurtosis    = -1.0648620173302747
push!(uhist, values_errors)
UniformHistogram{Float64}:
  length  = 5
  moments = 1.1102230246251565e-16  5.9125  0.0  65.54296875  

  Statistics
    mean        = 1.1102230246251565e-16
    variance    = 5.9125
    skewness    = -1.369764503743595e-16
    kurtosis    = -1.125075426073508

Note that the non-central statistical moments are updated in an online matter. This means that, aside from the overhead associated with push!ing two elements into the ContinuousHistogram's values and errors Vectors, there is an amortized cost associated with computing the statistics.

From here, we just need to define an input domain for the ContinuousHistograms to be computed over as

x = LinRange(-6, 6, 3000)
3000-element LinRange{Float64, Int64}:
 -6.0,-5.996,-5.992,-5.988,-5.98399,…,5.97999,5.98399,5.988,5.992,5.996,6.0

and then, with the help of Plots.jl and RecipesBase.jl, we have

plot( plot(x, ghist; title = "\$ \\mathtt{GaussianHistogram}ming \$"),
      plot(x, uhist; title = "\$ \\mathtt{UniformHistogram}ming \$");
      size = (800, 600),
      layout = (1, 2),
      link = :both )

In the left plot, one can see that the GaussianHistogram is plotted as the solid blue curve, and the individual gaussian kernels that make it up are plotted as the dashed orange curves. The right plot shows the same set of curves defined by the UniformDistribution, instead. (The orange curves are only zero within their visible range; otherwise they are hidden by the solid blue curve.)

I want to remark here that with the power of Julia's multiple dispatch, once one properly defines the interface for a new type of ContinuousHistogram, the plotting functionality, along with the utilities and statistics, just work.

Note

One may also supply the keyword argument nkernels to plot(x, hist) to change the number of kernels displayed. By default, nkernels == 5.

If the number of value-error pairs exceeds nkernels, that is nkernels < length(hist), then no kernels will be shown to save the end user from trying to understand an overly busy plot.

Add UncertainHistogramming.jl to your Julia environment

To add UncertainHistogramming.jl simply press ] in the Julia REPL to enter pkg mode and type

pkg> add UncertainHistogramming

and presto! You now have full access to UncertainHistogramming.jl.