What's New
Qbeast – What’s New
v0.4.0 – 2025-07-04
New release of Qbeast Spark, featuring two key enhancements: DML Support and Iceberg.
DML Support
Qbeast now supports:
- Deletes, Updates, and Merge operations via:
- Resilient Index Builder
- Merge On Read (MoR) strategies
- Optimization for unindexed files
Iceberg Support
Initial version of the Iceberg-Qbeast protocol:
- Index metadata stored via Puffin Files Spec.
- Compatible with Spark Datasource V2 APIs.
- Integration with Iceberg requires specific config and catalog setup.
Example Setup:
export QBEAST_SPARK_VERSION=0.9.0-rc1
$SPARK_HOME/bin/spark-shell \
--repositories https://maven.pkg.github.com/qbeast-io/qbeast-spark-private \
--conf spark.jars.ivySettings=$HOME/.ivy2/ivysettings.xml \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.1,io.qbeast:qbeast-iceberg_2.12:$QBEAST_SPARK_VERSION \
--conf spark.sql.extensions=io.qbeast.sql.IcebergQbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.IcebergQbeastSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hadoop \
--conf spark.sql.catalog.spark_catalog.warehouse=/tmp/iceberg-qbeast-warehouse
Supported APIs:
// Create
df.writeTo("qbeast_table").using("qbeast").option("columnsToIndex", "id").createOrReplace()
// Append
dfAppend.writeTo("qbeast_table").append()
// Save as Table
df.write.format("qbeast").option("columnsToIndex", "id").saveAsTable("qbeast_table")
// SQL
spark.sql("CREATE TABLE qbeast_table(id INT) USING qbeast TBLPROPERTIES('columnsToIndex' 'id')")
Experimental Performance Optimizations
Feature flags added for:
- Sampling and shuffling in OTree analysis
- Rollup strategies for cube generation
- Reduced analysis time and improved layouting
Modular Packaging
New split JARs:
- qbeast-delta: Delta-Qbeast interfaces
- qbeast-hudi: Hudi-Qbeast interfaces
- qbeast-iceberg: Iceberg-Qbeast interfaces (with independent IcebergQbeastCatalog)
Bug Fixes & Improvements
- Fix error loading dimension count for unindexed revision
- Remove Delta deps from quantile computation
- Fix README + IvySettings
- Denormalize cubeId as string in blocks
- Apply scalafmt and scalafix
- Retry write failures only a few times
- Close Hudi writer timeline server
v0.3.1 – 2025-04-17
Bug Fixes & Improvements
- Respect hoodie.table.timeline.timezone from hudi-defaults.conf
v0.3.0 – 2025-04-17
A major release introducing Hudi support, unindexed file optimization, and skewed column indexing.
Hudi Support
New module for Hudi integration.
Example Setup:
export QBEAST_SPARK_VERSION=0.8.0
$SPARK_HOME/bin/spark-shell --repositories https://maven.pkg.github.com/qbeast-io/qbeast-spark-private \
--packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0,io.qbeast:qbeast-spark_2.12:$QBEAST_SPARK_VERSION \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
--conf spark.sql.extensions=io.qbeast.sql.HudiQbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.HudiQbeastCatalog
Optimization of Unindexed Files
API support for optimizing legacy and externally-written files.
qbeastTable.optimize(0L, Seq("file1", "file2"))
qbeastTable.optimize(0L, fraction = 0.5)
Skewed Columns (Quantile Indexing)
Index highly skewed columns with quantile-based layout:
val quantiles = QbeastUtils.computeQuantilesForColumn(df, "brand")
val stats = s"""{"brand_quantiles":$quantiles}"""
df.write
.mode("overwrite")
.format("qbeast")
.option("columnsToIndex", "brand:quantiles")
.option("columnStats", stats)
.save("/tmp/qbeast_table_quantiles")
Other Fixes and Enhancements
- CI workflow setup, snapshot publishing, cron vulnerability checks
- Improved determinism checks and error handling
- Dependency updates: jinja2, werkzeug
- Fix computed metrics from Delta table history
- Fix Hudi commit timezone issue
- Ensure deletion event timestamp follows creation
- Fix issue loading table properties
- Use Hadoop 3.3.6 by default