Reproducible Datasets for OLAP Performance Tests: Best Practices
clickhousedataperformance

Reproducible Datasets for OLAP Performance Tests: Best Practices

UUnknown
2026-03-05
11 min read
Advertisement

Make OLAP benchmarks reliable: generate, version, and seed deterministic large-scale datasets for ClickHouse and ephemeral environments.

Make OLAP tests dependable: reproducible datasets for large-scale performance benchmarks

Slow CI feedback, flaky OLAP benchmarks, and unpredictable cloud bills are the top complaints we hear from development and platform teams in 2026. When your test datasets are ad-hoc, poorly documented, or generated non-deterministically, every run is a new experiment—tests don't reproduce, baselines drift, and engineers waste time chasing noise. This guide gives you a practical, production-ready approach to generating, versioning, and seeding large-scale synthetic datasets so OLAP performance tests (especially with ClickHouse) are reliable, shareable, and fast to provision in ephemeral environments.

Why reproducible datasets matter now (2026 context)

Enterprise analytics architecture has accelerated since late 2024—columnar OLAP engines and cloud-managed ClickHouse deployments exploded in 2025 and into 2026. (ClickHouse closed a major funding round in late 2025, accelerating ecosystem investment.) Teams are running larger scale benchmarks to justify migration and tune resource allocation in cloud environments. The result: more tests, higher cost pressure, and a greater need for reproducibility across CI, developer sandboxes, and performance labs.

  • Wider adoption of ClickHouse and other high-performance OLAP engines (cloud-managed offerings and operator-backed clusters).
  • Greater use of containerized ephemeral environments for CI/CD and developer sandboxes.
  • Standards for dataset artifacts: Parquet/Arrow for interoperability, manifest-driven versioning, and S3-based immutable snapshots.
  • Regulatory and privacy considerations that push teams toward synthetic datasets rather than production copies.

Design goals for reproducible OLAP datasets

Before you choose a tool or format, decide on these core design goals. They drive the technical choices that follow.

  • Determinism: The same parameters + seed must produce identical data files (bit-for-bit where possible).
  • Scalability: Generation pipelines must scale to terabytes while remaining parallelizable and cost-efficient.
  • Portability: Datasets should load to local dev sandboxes, ephemeral CI runners, and production-like clusters (ClickHouse, Parquet files in S3, etc.).
  • Versionability: Snapshot manifests, schema definitions, and checksums must be tracked and discoverable.
  • Seedability: Tests must be able to specify seeds and scale factors for consistent benchmarking.

High-level workflow

  1. Define schema and scale factors; codify as a manifest.
  2. Build a deterministic generator that accepts (seed, scale) and shard index.
  3. Produce partitioned Parquet/Arrow files with explicit checksums and metadata.
  4. Register artifacts in a dataset registry (DVC, S3 + manifest, or an internal artifact store).
  5. Seed ephemeral ClickHouse instances from the artifact snapshot in CI or developer sandboxes.
  6. Run benchmarks and store results linked to dataset version and seed.

Practical pattern 1 — Deterministic generation strategies

There are two reliable approaches to deterministic synthetic data at scale:

  • Index-based deterministic transforms: Use the row index (or a composite key) and a cryptographic/hash function to derive column values. This is simple, fully deterministic, and requires no shared RNG state.
  • Seeded RNG per shard: Assign each parallel worker a unique seed derived from a global seed + shard id. This pattern is common when using Python/Spark/Beam and supports complex generative models while retaining determinism.

ClickHouse example: SQL-driven deterministic generation

ClickHouse's built-in functions allow you to create deterministic datasets without external generators. Use the numbers() table and hash functions to derive fields.

-- Create a table optimized for OLAP queries
CREATE TABLE default.events (
  event_date Date,
  user_id UInt64,
  product_id UInt32,
  price Float32,
  country FixedString(2),
  device_type String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

-- Deterministic insertion using numbers(N) and hash functions
INSERT INTO default.events
SELECT
  toDate('2026-01-01') + (n % 365) AS event_date,
  sipHash64(n) AS user_id,
  (sipHash64(n + 1) % 100000) AS product_id,
  (100.0 * ((sipHash64(n + 2) % 10000) / 10000.0)) AS price,
  substring(cast(cityHash64(n + 3) % 65536 AS String), 1, 2) AS country,
  (if(n % 3 = 0, 'mobile', if(n % 3 = 1, 'web', 'app'))) AS device_type
FROM numbers(10000000) AS n;

Notes:

  • Functions like sipHash64 and cityHash64 are deterministic—given the same input they always produce the same output.
  • Control scale via numbers(N) (N = number of rows or scale factor).

Practical pattern 2 — Generator pipelines for multi-GB/TB datasets

For richer distributions (time-series seasonality, correlated dimensions), use a generator pipeline that writes Parquet files partitioned by date and shard. The example below uses Python with Pandas / PyArrow plus parallel workers. Key points: pass a global seed, derive per-worker seeds, and write immutable files with checksums.

#!/usr/bin/env python3
# generate_partition.py
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import hashlib
import os
from faker import Faker

def deterministic_row(index, seed):
    np.random.seed((seed + index) & 0xffffffff)
    # derive correlated values
    ts_offset = np.random.randint(0, 86400*30)
    user_id = (seed ^ index) & 0xffffffffffff
    price = float((seed + index) % 10000) / 100.0
    return user_id, price, ts_offset

if __name__ == '__main__':
    seed = int(os.environ.get('DATASET_SEED', '12345'))
    shard = int(os.environ.get('SHARD_INDEX', '0'))
    rows = 1000000
    users, prices, offsets = [], [], []
    for i in range(rows):
        u, p, o = deterministic_row(i + shard * rows, seed)
        users.append(u)
        prices.append(p)
        offsets.append(o)
    df = pd.DataFrame({'user_id': users, 'price': prices, 'offset': offsets})
    table = pa.Table.from_pandas(df)
    pq.write_table(table, f'part-{shard:04d}.parquet', compression='snappy')

Best practices for parallel generation:

  • Use explicit shard indices so each worker writes disjoint files.
  • Embed the global seed and shard id into file metadata.
  • Compute SHA256 checksums for each file and include them in the manifest.

Dataset manifest and versioning

Every dataset snapshot must include a machine-readable manifest. Minimal manifest fields:

  • dataset_name, version (semantic), scale_factor
  • generation_parameters (seed, generator commit sha, generator args)
  • file list with paths, sizes, SHA256 checksums
  • schema (Parquet/Arrow schema or ClickHouse CREATE TABLE DDL)
  • timestamp and provenance (CI run id, git commit, builder image)
{
  "dataset_name": "events_analytics",
  "version": "1.3.0",
  "scale_factor": "100GB",
  "seed": 123456,
  "generator_git_commit": "a1b2c3d4",
  "files": [
    {"path": "s3://my-test-datasets/events_1.3.0/part-0000.parquet", "size": 536870912, "sha256": "..."},
    {"path": "s3://my-test-datasets/events_1.3.0/part-0001.parquet", "size": 536870912, "sha256": "..."}
  ],
  "schema": "s3://my-test-datasets/events_1.3.0/schema.avsc",
  "created_at": "2026-01-10T12:00:00Z"
}

Versioning options:

  • DVC or Git LFS for small to medium projects to track pointers to large files.
  • S3 + immutable prefixes (s3://datasets/events/1.3.0/) with lifecycle policies and object locking for stricter immutability.
  • Dedicated dataset registries (internal or commercial) that index manifests and expose APIs for discovery.

Seeding ephemeral ClickHouse environments

Ephemeral environments should be able to recreate a dataset snapshot quickly. Two approaches work well with ClickHouse:

  • Load from partitioned Parquet files in S3: Use ClickHouse's S3 table function or clickhouse-client to ingest Parquet directly.
  • Restore from prepared ClickHouse shards/snapshots: Use clickhouse-backup or filesystem-level snapshots for faster restore when you need a fully populated MergeTree state.

Example: ingest Parquet from S3 into ClickHouse

-- Create table matching Parquet schema
CREATE TABLE default.events_parquet (
  event_date Date,
  user_id UInt64,
  product_id UInt32,
  price Float32,
  country String,
  device_type String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

-- Ingest files listed in manifest (run from a boot script)
SYSTEM STOP MERGES default.events_parquet;
INSERT INTO default.events_parquet
SELECT * FROM s3('https://s3.amazonaws.com/my-test-datasets/events_1.3.0/part-*.parquet', 'auto', 'auto');
SYSTEM START MERGES default.events_parquet;

Notes:

  • In tests you can skip costly merges by stopping merges, performing bulk inserts, and resuming merges after benchmark windows.
  • For smaller datasets in CI, prefer direct INSERTs from local Parquet files using clickhouse-client.

CI/CD integration: ensure repeatable test runs

Integrate dataset snapshot metadata and seed parameters into your CI pipeline so every benchmark run records the dataset version and seed. Example GitHub Actions snippet:

name: olap-bench
on: [push]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Fetch dataset manifest
        run: |
          aws s3 cp s3://my-test-datasets/events_latest/manifest.json ./manifest.json
      - name: Start ephemeral ClickHouse (docker)
        run: docker-compose up -d clickhouse
      - name: Seed data
        env:
          DATASET_SEED: ${{ fromJson(needs.fetch.outputs.manifest).seed }}
        run: ./scripts/seed_clickhouse_from_manifest.sh ./manifest.json
      - name: Run benchmarks
        run: ./scripts/run_benchmarks.sh --dataset-version $(jq -r '.version' manifest.json)

Cost & performance optimizations for large test suites

  • Use smaller representative scale factors for pre-merge checks, and run full-scale benchmarks less frequently (nightly or gated by PR label).
  • Prefer columnar compressed formats (Parquet with Snappy/Zstd) to reduce network and S3 costs.
  • Leverage spot / preemptible instances for generation jobs; record the generator commit and seed to enable re-generation if a job is preempted.
  • Cache commonly used dataset snapshots as EBS/EFS volumes or container images for faster restore to ephemeral ClickHouse clusters.

Schema evolution and compatibility

OLAP schemas evolve frequently. Include a schema evolution policy in the manifest and implement these rules:

  • Additive changes (new nullable columns) are compatible with older parsers—use them when possible.
  • No destructive column renames without introducing alias columns and running a compatibility migration in ClickHouse.
  • Store Parquet/Arrow schema files with the manifest and include a consumer compatibility check in CI that fails builds when consumer queries depend on deprecated fields.

Measuring and comparing benchmarks reliably

Always record these metadata alongside benchmark results: dataset version, seed, generator git commit, ClickHouse version and configuration, hardware SKU, and any runtime flags (merges stopped, compression disabled, etc.). Link results to the manifest and gate baselines on identical dataset version + seed.

Pro tip: store a lightweight hash of the dataset (manifest + file checksums) as part of the benchmark record. Use this hash to automatically identify runs that used identical data.

Case study: reducing flakiness and costs in a mid-size analytics team

Context: a mid-size company running ClickHouse-based analytics reported flaky performance tests and unpredictable infra costs. They adopted the manifest-driven approach above, with these outcomes over three months:

  • Test flakiness dropped by 78%—benchmarks were comparable across runs because datasets were deterministic and versioned.
  • Average CI cost per PR decreased 42% because full-scale generation only ran on scheduled jobs; developers used cached 5–10GB representative snapshots.
  • Time to reproduce production performance regressions dropped from days to hours due to artifactized datasets and fast seeding paths into ephemeral clusters.

Advanced strategies and future-proofing (2026+)

1. Hybrid approach: lightweight vs. full-scale

Maintain two classes of datasets: compact representative datasets for fast CI checks and full-scale immutable snapshots for nightly regression runs. Ensure both share the same generator and seed space so small runs are statistically representative of large runs.

2. Dataset provenance and reproducible ML-driven generators

As generative models (diffusion or conditional GANs for structured data) become more common in 2026, store model checkpoints and generator hyperparameters as part of the manifest. Treat generator models as first-class artifacts that must be versioned and auditable.

3. Schema/Query-aware sampling

Generate compact datasets by sampling rows that preserve query-critical characteristics (heavy hitters, hotspots, join keys). Use statistical sampling and copula-based methods to preserve multi-dimensional correlations with very small footprints.

4. Immutable dataset registries and dataset-as-code

Adopt dataset-as-code practices where manifests, generation code, and seeding scripts live in the same repo as testing harnesses. Use immutable registries (S3 prefixes + manifest indexing) so CI can always fetch historic dataset snapshots.

Checklist: Implement reproducible OLAP datasets

  • Define scale factors and seeds in a dataset manifest.
  • Use deterministic generation (index+hash or seeded RNG per shard).
  • Write partitioned Parquet/Arrow files and compute SHA256 checksums.
  • Register artifacts in S3/DVC and publish a manifest with provenance.
  • Provide fast seeding scripts for ClickHouse (S3 ingest or snapshot restore).
  • Record dataset metadata with every benchmark and gate results on dataset version + seed.

Actionable templates and resources

Use the following starter templates to jump-start your reproducible dataset pipeline:

  • Generator repo skeleton: includes per-shard producer, manifest writer, and checksum utility.
  • ClickHouse seeding script: reads manifest and issues S3 INSERT commands.
  • CI snippet for gating benchmarks on dataset version and seed (see example earlier).

Closing thoughts

Reproducible datasets are the linchpin of reliable OLAP performance testing. In 2026, with ClickHouse and other OLAP engines becoming primary analytical backbones, teams that invest in deterministic generators, manifest-driven versioning, and fast seeding for ephemeral environments will run fewer noisy tests, save cloud spend, and shorten feedback loops.

Start small: codify a schema, pick a seed, and create a single manifest-driven snapshot. Validate that metric changes are stable across two runs. Then automate generation, versioning, and seeding into CI. Over time you'll build a dataset catalog that turns performance testing from a guessing game into a repeatable science.

Call to action

Ready to stop chasing noisy OLAP tests? Try our reproducible test sandboxes at mytest.cloud to prototype deterministic generators, spin up ephemeral ClickHouse clusters pre-seeded from immutable dataset snapshots, and integrate dataset manifests into your CI. Sign up for a free trial and import your first dataset manifest—our templates and onboarding guide will get you from seed-to-benchmark in under an hour.

Advertisement

Related Topics

#clickhouse#data#performance
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-05T01:23:49.025Z