What's New

Qbeast – What’s New

v0.4.0 – 2025-07-04

New release of Qbeast Spark, featuring two key enhancements: DML Support and Iceberg.

DML Support

Qbeast now supports:

  • Deletes, Updates, and Merge operations via:
    • Resilient Index Builder
    • Merge On Read (MoR) strategies
    • Optimization for unindexed files

Iceberg Support

Initial version of the Iceberg-Qbeast protocol:

  • Index metadata stored via Puffin Files Spec.
  • Compatible with Spark Datasource V2 APIs.
  • Integration with Iceberg requires specific config and catalog setup.

Example Setup:

export QBEAST_SPARK_VERSION=0.9.0-rc1
$SPARK_HOME/bin/spark-shell \
  --repositories https://maven.pkg.github.com/qbeast-io/qbeast-spark-private \
  --conf spark.jars.ivySettings=$HOME/.ivy2/ivysettings.xml \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.1,io.qbeast:qbeast-iceberg_2.12:$QBEAST_SPARK_VERSION \
  --conf spark.sql.extensions=io.qbeast.sql.IcebergQbeastSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.IcebergQbeastSessionCatalog \
  --conf spark.sql.catalog.spark_catalog.type=hadoop \
  --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/iceberg-qbeast-warehouse

Supported APIs:

// Create
df.writeTo("qbeast_table").using("qbeast").option("columnsToIndex", "id").createOrReplace()
 
// Append
dfAppend.writeTo("qbeast_table").append()
 
// Save as Table
df.write.format("qbeast").option("columnsToIndex", "id").saveAsTable("qbeast_table")
 
// SQL
spark.sql("CREATE TABLE qbeast_table(id INT) USING qbeast TBLPROPERTIES('columnsToIndex' 'id')")

Experimental Performance Optimizations

Feature flags added for:

  • Sampling and shuffling in OTree analysis
  • Rollup strategies for cube generation
  • Reduced analysis time and improved layouting

Modular Packaging

New split JARs:

  • qbeast-delta: Delta-Qbeast interfaces
  • qbeast-hudi: Hudi-Qbeast interfaces
  • qbeast-iceberg: Iceberg-Qbeast interfaces (with independent IcebergQbeastCatalog)

Bug Fixes & Improvements

  • Fix error loading dimension count for unindexed revision
  • Remove Delta deps from quantile computation
  • Fix README + IvySettings
  • Denormalize cubeId as string in blocks
  • Apply scalafmt and scalafix
  • Retry write failures only a few times
  • Close Hudi writer timeline server

v0.3.1 – 2025-04-17

Bug Fixes & Improvements

  • Respect hoodie.table.timeline.timezone from hudi-defaults.conf

v0.3.0 – 2025-04-17

A major release introducing Hudi support, unindexed file optimization, and skewed column indexing.

Hudi Support

New module for Hudi integration.

Example Setup:

export QBEAST_SPARK_VERSION=0.8.0
$SPARK_HOME/bin/spark-shell --repositories https://maven.pkg.github.com/qbeast-io/qbeast-spark-private \
--packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0,io.qbeast:qbeast-spark_2.12:$QBEAST_SPARK_VERSION \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
--conf spark.sql.extensions=io.qbeast.sql.HudiQbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.HudiQbeastCatalog

Optimization of Unindexed Files

API support for optimizing legacy and externally-written files.

qbeastTable.optimize(0L, Seq("file1", "file2"))
qbeastTable.optimize(0L, fraction = 0.5)

Skewed Columns (Quantile Indexing)

Index highly skewed columns with quantile-based layout:

val quantiles = QbeastUtils.computeQuantilesForColumn(df, "brand")
val stats = s"""{"brand_quantiles":$quantiles}"""
 
df.write
  .mode("overwrite")
  .format("qbeast")
  .option("columnsToIndex", "brand:quantiles")
  .option("columnStats", stats)
  .save("/tmp/qbeast_table_quantiles")

Other Fixes and Enhancements

  • CI workflow setup, snapshot publishing, cron vulnerability checks
  • Improved determinism checks and error handling
  • Dependency updates: jinja2, werkzeug
  • Fix computed metrics from Delta table history
  • Fix Hudi commit timezone issue
  • Ensure deletion event timestamp follows creation
  • Fix issue loading table properties
  • Use Hadoop 3.3.6 by default