Basic usage¶
This notebook walks through the core workflow:
- Creating a hypergrid with
DenseHypergridandSparseHypergrid - Fitting and incrementally updating
- Inspecting mass, edges, and shape
- Describing the distribution with
describe() - Converting to a dense array and probability vector
- Basic visualizations
In [1]:
Copied!
import numpy as np
import matplotlib.pyplot as plt
from hypergrid import DenseHypergrid, SparseHypergrid, compute_edges
import numpy as np
import matplotlib.pyplot as plt
from hypergrid import DenseHypergrid, SparseHypergrid, compute_edges
1. Generating sample data¶
In [2]:
Copied!
rng = np.random.default_rng(42)
data = rng.standard_normal((5000, 3))
print(f"Data shape: {data.shape}")
print(f"Column means: {data.mean(axis=0).round(3)}")
print(f"Column stds: {data.std(axis=0).round(3)}")
rng = np.random.default_rng(42)
data = rng.standard_normal((5000, 3))
print(f"Data shape: {data.shape}")
print(f"Column means: {data.mean(axis=0).round(3)}")
print(f"Column stds: {data.std(axis=0).round(3)}")
Data shape: (5000, 3) Column means: [ 0.007 -0.001 -0.035] Column stds: [1.006 1.006 1. ]
2. Computing bin edges automatically¶
In [3]:
Copied!
edges = compute_edges(data, method="fd") # Freedman-Diaconis
for d, e in enumerate(edges):
print(f"Dim {d}: {len(e)-1} bins [{e[0]:.2f}, {e[-1]:.2f}]")
edges = compute_edges(data, method="fd") # Freedman-Diaconis
for d, e in enumerate(edges):
print(f"Dim {d}: {len(e)-1} bins [{e[0]:.2f}, {e[-1]:.2f}]")
Dim 0: 51 bins [-3.89, 4.15] Dim 1: 47 bins [-3.65, 3.45] Dim 2: 51 bins [-4.39, 3.44]
3. DenseHypergrid — numpy backend¶
In [4]:
Copied!
dense = DenseHypergrid(edges)
dense.fit(data)
mass = dense.get_mass() # {tuple_index: float}
print(f"Non-empty bins: {len(mass)}")
print(f"Total mass: {sum(mass.values()):.0f} (should be ~{len(data)})")
print(f"Grid shape: {dense.shape}")
dense = DenseHypergrid(edges)
dense.fit(data)
mass = dense.get_mass() # {tuple_index: float}
print(f"Non-empty bins: {len(mass)}")
print(f"Total mass: {sum(mass.values()):.0f} (should be ~{len(data)})")
print(f"Grid shape: {dense.shape}")
Non-empty bins: 4144 Total mass: 4997 (should be ~5000) Grid shape: [51, 47, 51]
4. Incremental update¶
In [5]:
Copied!
new_batch = rng.standard_normal((500, 3))
dense.update(new_batch)
print(f"Total mass after update: {sum(dense.get_mass().values()):.0f}")
new_batch = rng.standard_normal((500, 3))
dense.update(new_batch)
print(f"Total mass after update: {sum(dense.get_mass().values()):.0f}")
Total mass after update: 5497
5. SparseHypergrid — sparse dict backend¶
In [6]:
Copied!
sparse = SparseHypergrid(edges)
sparse.fit(data)
print(f"Non-empty bins (sparse): {len(sparse.get_mass())}")
print(f"Grid cells total: {int(np.prod(sparse.shape))}")
sparse = SparseHypergrid(edges)
sparse.fit(data)
print(f"Non-empty bins (sparse): {len(sparse.get_mass())}")
print(f"Grid cells total: {int(np.prod(sparse.shape))}")
Non-empty bins (sparse): 4144 Grid cells total: 122247
6. Describing the distribution¶
describe() returns a pandas DataFrame with summary statistics computed from
the binned histogram -- one column per dimension.
Statistics are derived from each dimension's marginal distribution:
| Row | Meaning |
|---|---|
count |
Total mass (number of data points binned) |
mean |
Probability-weighted mean of bin centres |
std |
Probability-weighted population standard deviation |
skewness |
Third standardised central moment (0 for symmetric distributions) |
kurtosis |
Excess kurtosis: fourth moment / std^4 - 3 (0 for a normal distribution) |
min / max |
Lower / upper edge of the outermost non-empty bin |
25%, 50%, 75% |
Percentiles via marginal CDF interpolation |
In [7]:
Copied!
dense.describe()
dense.describe()
Out[7]:
| 0 | 1 | 2 | |
|---|---|---|---|
| count | 5497.000000 | 5497.000000 | 5497.000000 |
| mean | 0.006895 | -0.003962 | -0.030504 |
| std | 1.002774 | 1.004823 | 0.999355 |
| skewness | 0.002880 | 0.029218 | 0.006473 |
| kurtosis | 0.049386 | 0.053493 | -0.116375 |
| min | -3.887804 | -3.648413 | -4.389115 |
| 25% | -0.659176 | -0.674558 | -0.707197 |
| 50% | -0.011989 | -0.013463 | -0.035536 |
| 75% | 0.690362 | 0.646047 | 0.642301 |
| max | 4.151241 | 3.302930 | 3.282495 |
Custom percentiles¶
Pass any list of quantile positions in [0, 1].
In [8]:
Copied!
dense.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
dense.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
Out[8]:
| 0 | 1 | 2 | |
|---|---|---|---|
| count | 5497.000000 | 5497.000000 | 5497.000000 |
| mean | 0.006895 | -0.003962 | -0.030504 |
| std | 1.002774 | 1.004823 | 0.999355 |
| skewness | 0.002880 | 0.029218 | 0.006473 |
| kurtosis | 0.049386 | 0.053493 | -0.116375 |
| min | -3.887804 | -3.648413 | -4.389115 |
| 5% | -1.627258 | -1.667982 | -1.684911 |
| 25% | -0.659176 | -0.674558 | -0.707197 |
| 50% | -0.011989 | -0.013463 | -0.035536 |
| 75% | 0.690362 | 0.646047 | 0.642301 |
| 95% | 1.627947 | 1.696625 | 1.643082 |
| max | 4.151241 | 3.302930 | 3.282495 |
Comparing describe() output with raw data statistics¶
The histogram-derived statistics approximate the true data statistics. Accuracy improves with finer bins.
In [9]:
Copied!
import pandas as pd
from scipy.stats import skew, kurtosis as sp_kurtosis
grid_stats = dense.describe()
# True statistics from raw data (dim 0 only for brevity)
print("=== Dim 0 comparison ==")
print(f"{'Statistic':<12} {'Grid':>10} {'Raw data':>10}")
print("-" * 34)
print(f"{'mean':<12} {grid_stats.loc['mean', 0]:>10.4f} {data[:, 0].mean():>10.4f}")
print(f"{'std':<12} {grid_stats.loc['std', 0]:>10.4f} {data[:, 0].std():>10.4f}")
print(f"{'skewness':<12} {grid_stats.loc['skewness', 0]:>10.4f} {skew(data[:, 0]):>10.4f}")
print(f"{'kurtosis':<12} {grid_stats.loc['kurtosis', 0]:>10.4f} {sp_kurtosis(data[:, 0]):>10.4f}")
print(f"{'25%':<12} {grid_stats.loc['25%', 0]:>10.4f} {np.percentile(data[:, 0], 25):>10.4f}")
print(f"{'50%':<12} {grid_stats.loc['50%', 0]:>10.4f} {np.percentile(data[:, 0], 50):>10.4f}")
print(f"{'75%':<12} {grid_stats.loc['75%', 0]:>10.4f} {np.percentile(data[:, 0], 75):>10.4f}")
import pandas as pd
from scipy.stats import skew, kurtosis as sp_kurtosis
grid_stats = dense.describe()
# True statistics from raw data (dim 0 only for brevity)
print("=== Dim 0 comparison ==")
print(f"{'Statistic':<12} {'Grid':>10} {'Raw data':>10}")
print("-" * 34)
print(f"{'mean':<12} {grid_stats.loc['mean', 0]:>10.4f} {data[:, 0].mean():>10.4f}")
print(f"{'std':<12} {grid_stats.loc['std', 0]:>10.4f} {data[:, 0].std():>10.4f}")
print(f"{'skewness':<12} {grid_stats.loc['skewness', 0]:>10.4f} {skew(data[:, 0]):>10.4f}")
print(f"{'kurtosis':<12} {grid_stats.loc['kurtosis', 0]:>10.4f} {sp_kurtosis(data[:, 0]):>10.4f}")
print(f"{'25%':<12} {grid_stats.loc['25%', 0]:>10.4f} {np.percentile(data[:, 0], 25):>10.4f}")
print(f"{'50%':<12} {grid_stats.loc['50%', 0]:>10.4f} {np.percentile(data[:, 0], 50):>10.4f}")
print(f"{'75%':<12} {grid_stats.loc['75%', 0]:>10.4f} {np.percentile(data[:, 0], 75):>10.4f}")
=== Dim 0 comparison == Statistic Grid Raw data ---------------------------------- mean 0.0069 0.0071 std 1.0028 1.0055 skewness 0.0029 0.0297 kurtosis 0.0494 0.1305 25% -0.6592 -0.6594 50% -0.0120 -0.0177 75% 0.6904 0.6886
7. Converting to a dense array and probability vector¶
In [10]:
Copied!
arr = dense.to_dense() # numpy array
print(f"Dense array shape: {arr.shape}")
print(f"Array sum: {arr.sum():.0f}")
vec = dense.to_vector() # flat probability vector
print(f"\nProbability vector length: {len(vec)}")
print(f"Vector sum: {vec.sum():.6f}")
arr = dense.to_dense() # numpy array
print(f"Dense array shape: {arr.shape}")
print(f"Array sum: {arr.sum():.0f}")
vec = dense.to_vector() # flat probability vector
print(f"\nProbability vector length: {len(vec)}")
print(f"Vector sum: {vec.sum():.6f}")
Dense array shape: (51, 47, 51) Array sum: 5497 Probability vector length: 122247 Vector sum: 1.000000
8. Visualizations¶
In [11]:
Copied!
dense.plot_all_marginals()
dense.plot_all_marginals()
In [12]:
Copied!
dense.plot_joint(dim_x=0, dim_y=1)
dense.plot_joint(dim_x=0, dim_y=1)
In [13]:
Copied!
dense.plot_top_bins(k=20)
dense.plot_top_bins(k=20)
9. Rebinning onto coarser edges¶
In [14]:
Copied!
coarse_edges = [np.linspace(e[0], e[-1], 6) for e in edges]
rebinned = dense.rebin_to(coarse_edges) # returns dict
print(f"Bins in rebinned grid: {len(rebinned)}")
print(f"Mass conserved: {sum(rebinned.values()):.0f}")
coarse_edges = [np.linspace(e[0], e[-1], 6) for e in edges]
rebinned = dense.rebin_to(coarse_edges) # returns dict
print(f"Bins in rebinned grid: {len(rebinned)}")
print(f"Mass conserved: {sum(rebinned.values()):.0f}")
Bins in rebinned grid: 78 Mass conserved: 5497
In [ ]:
Copied!