Qbeast Platform - Getting Started

Qbeast Platform Quick Start Guide

The Qbeast Platform offers a Universal Storage Engine (USE) that gathers all table information to show to the user. This guide shows you how to use the storage engine to work with your tables and explore statistics.

This guide will walk you through connecting, configuring tables for monitoring, and exploring the available statistics and metadata.

Getting Started

You can connect to the USE core and start working with your tables.

Connect to USE Core

Connect to USE Core using Spark Connect:

$SPARK_HOME/bin/pyspark --remote "sc://<spark-connect-host>:15002"

Before working with tables, verify that the connection works by creating a DataFrame and showing its content:

columns = ["id","name"]
data = [(1,"Sarah"),(2,"Maria")]
df = spark.createDataFrame(data).toDF(*columns)
df.show()

If the DataFrame is shown, you can proceed to the next step.

Working with Tables

Create and Configure Tables

The Universal Storage Engine will take care of the tables you choose to manage.

If you want a specific table to be ingested and automatically optimized by Qbeast, you should set up the path for the ingestion events. Those events will allow the tool to monitor the data pipelines and grant you access to aggregated statistics for better observability.

-- CREATE TABLE
CREATE TABLE t(id INT, name STRING) USING qbeast 
TBLPROPERTIES(use.ingestion.source.path.files='s3://bucket')
 
-- OR ALTER TABLE
ALTER TABLE t SET TBLPROPERTIES(use.ingestion.source.path.files='s3://bucket')

To set up Consumption Monitoring Process on the Spark Backend, you should set the TBLPROPERTIES accordingly:

-- CREATE TABLE
CREATE TABLE t(id INT, name STRING) USING qbeast 
TBLPROPERTIES(use.consumption.enabled='true')
 
-- ALTER TABLE
ALTER TABLE t SET TBLPROPERTIES(use.consumption.enabled = 'true');

Write Data to Tables

df.write.insertInto("t")

Explore USE Statistics

The available statistics to query from the USE are available as Tables through USEReadOnlyCatalog:

spark.sql("SHOW TABLES IN use_catalog").show(100, False)

The output would look something like this:

+------------------------------+-------------------+-----------+
|namespace                     |tableName          |isTemporary|
+------------------------------+-------------------+-----------+
|system_lake_tables            |history            |false      |
|system_lake_tables            |tables_summary     |false      |
|system_lake_tables            |files              |false      |
|system_lake_tables            |catalogs_summary   |false      |
|system_lake_tables            |namespaces_summary |false      |
|system_lake_tables            |metastore_summary  |false      |
|spark_catalog.default.students|history            |false      |
|spark_catalog.default.students|files              |false      |
|spark_catalog.default.students|files_hourly       |false      |
|spark_catalog.default.students|ingestions         |false      |
|spark_catalog.default.students|ingestions_hourly  |false      |
|spark_catalog.default.students|consumptions       |false      |
|spark_catalog.default.students|consumptions_hourly|false      |
+------------------------------+-------------------+-----------+

Query Global Summaries

You can query global summaries by doing:

# TABLES
spark.sql("SELECT * FROM use_catalog.system_lake_tables.tables_summary").show()
 
# NAMESPACES
spark.sql("SELECT * FROM use_catalog.system_lake_tables.namespaces_summary").show()
 
# CATALOGS
spark.sql("SELECT * FROM use_catalog.system_lake_tables.catalogs_summary").show()
 
# ALL TOGETHER
spark.sql("SELECT * FROM use_catalog.system_lake_tables.metastore_summary").show()

Query Table-Specific Information

You can also access detailed information for specific tables:

# HISTORY
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.history").show()
 
# FILES
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.files").show()
 
# FILES HOURLY AGGREGATION
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.files_hourly").show()
 
# INGESTIONS
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.ingestions").show()
 
# INGESTIONS HOURLY AGGREGATION
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.ingestions_hourly").show()
 
# CONSUMPTIONS
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.consumptions").show()
 
# CONSUMPTIONS HOURLY AGGREGATION
spark.sql("SELECT * FROM use_catalog.spark_catalog.default.students.consumptions_hourly").show()

The tables and schemas that we save in the system are described in the Data Model.

Optimization

The USE Core has two main optimization processes: RevisionOptimization and DeltaVacuum.

These optimization processes help maintain your tables by:

  • Reducing cube fragmentation and improving query performance
  • Managing data layout to align with the OTree index
  • Cleaning up unused files

All optimization processes will emit an OptimizationEvent saved under _qbeast/insights/optimization/events.

Read more about optimization features and how to configure them in the Optimization Guide.