DATA MODEL

DATA MODEL

Documentation on the USE Core Data Model.

Tables, Views and Schemas would be detailed in the following sections. If you find any information missing, please feel free to speak in the Qbeast Users Slack Channel.

TABLES

These are the Descriptions and Schemas for each of the Tables.

CatalogsSummary

Table Name

<use_catalog>.system_lake_tables.catalogs_summary

Description

Schema

root
|-- catalog_name: string
|-- default_namespace: array
|    |-- element: string
|-- namespaces: array
|    |-- element: array 
|    |    |-- element: string
|-- tables: array
|    |-- element: struct
|    |    |-- catalog_name: string
|    |    |-- namespace_name: array
|    |    |    |-- element: string
|    |    |-- name: string
|-- views: array
|    |-- element: struct
|    |    |-- catalog_name: string
|    |    |-- namespace_name: array
|    |    |    |-- element: string
|    |    |-- name: string 
|-- functions: array
|    |-- element: struct
|    |    |-- catalog_name: string
|    |    |-- namespace_name: array
|    |    |    |-- element: string
|    |    |-- name: string 
|-- table_catalog_capabilities: array
|    |-- element: string
|-- description: string
|-- catalog_type: string

NamespacesSummary

Table Name

<use_catalog>.system_lake_tables.namespaces_summary

Description

Schema

root
|-- namespace_name: array
|    |-- element: string
|-- catalog_name: string
|-- description: string
|-- location: string
|-- owner: string
|-- is_default: boolean
|-- tables: array
|    |-- element: struct
|    |    |-- catalog_name: string
|    |    |-- namespace_name: array
|    |    |    |-- element: string
|    |    |-- name: string

TablesSummary

Table Name

<use_catalog>.system_lake_tables.tables_summary

Description

Schema

root
|-- catalog_name: string 
|-- namespace_name: array 
|    |-- element: string
|-- name: string 
|-- created_time: string 
|-- last_access: string 
|-- provider: string 
|-- table_type: string 
|-- created_by: string 
|-- location: string 

MetastoreSummary

Table Name

<use_catalog>.system_lake_tables.metastore_summary

Description

Schema

root
 |-- catalog_name: string 
 |-- catalog_default_namespace: array
 |    |-- element: string
 |-- catalog_namespaces: array 
 |    |-- element: array
 |    |    |-- element: string
 |-- catalog_tables: array 
 |    |-- element: struct
 |    |    |-- catalog_name: string 
 |    |    |-- namespace_name: array 
 |    |    |    |-- element: string
 |    |    |-- name: string 
 |-- catalog_views: array 
 |    |-- element: struct
 |    |    |-- catalog_name: string 
 |    |    |-- namespace_name: array 
 |    |    |    |-- element: string
 |    |    |-- name: string 
 |-- catalog_functions: array 
 |    |-- element: struct
 |    |    |-- catalog_name: string 
 |    |    |-- namespace_name: array 
 |    |    |    |-- element: string
 |    |    |-- name: string 
 |-- table_catalog_capabilities: array 
 |    |-- element: string
 |-- catalog_description: string 
 |-- catalog_type: string 
 |-- namespace_name: array 
 |    |-- element: string 
 |-- namespace_description: string 
 |-- namespace_location: string 
 |-- namespace_owner: string 
 |-- namespace_tables: array 
 |    |-- element: struct 
 |    |    |-- catalog_name: string
 |    |    |-- namespace_name: array
 |    |    |    |-- element: string
 |    |    |-- name: string 
 |-- namespace_is_default: boolean
 |-- table_name: string 
 |-- table_owner: string
 |-- table_created_time: string 
 |-- table_last_access: string 
 |-- table_provider: string 
 |-- table_type: string 
 |-- table_created_by: string 
 |-- table_location: string 
 |-- table_columns: array 
 |    |-- element: struct
 |    |    |-- name: string
 |    |    |-- description: string 
 |    |    |-- data_type: string 

TableHistory

Table Name

<use_catalog>.system_lake_tables.history

Description

Schema

root
|-- catalog_name: string 
|-- namespace_name: array 
|    |-- element: string
|-- name: string 
|-- version: long 
|-- starting_time: timestamp 
|-- ending_time: timestamp 
|-- operation_type: string 
|-- format_specific_operation_type: string 
|-- added_rows: long 
|-- deleted_rows: long 
|-- updated_rows: long 
|-- estimated_net_byte_change: long 
|-- created_files: long 
|-- created_bytes: long 
|-- purged_files: long 
|-- purged_bytes: long 

TableFileStats

Table Name

<use_catalog>.system_lake_tables.files AND <use_catalog>.system_lake_tables.all_files

Description

Schema

root
|-- catalog_name: string 
|-- namespace_name: array 
|    |-- element: string (containsNull = true)
|-- name: string 
|-- path: string 
|-- schema: string 
|-- addition_timestamp: long 
|-- modification_timestamp: long 
|-- deletion_timestamp: long 
|-- creation_timestamp: long 
|-- purge_timestamp: long 
|-- deleted: boolean 
|-- purged: boolean 
|-- file_bytes: long 
|-- file_rows: long 
|-- metadata: map 
|    |-- key: string
|    |-- value: string
|-- column_metrics: string 

TableIngestion

The path of the Ingestion should be configured when using CREATE TABLE in the TBLPROPERTIES section or by the SET TBLPROPERTIES command.

// CREATE TABLE
CREATE TABLE t TBLRPROPERTIES(use.ingestion.source.file.path = '/ingestion/path')
// ALTER TABLE
ALTER TABLE t SET TBLPROPERTIES(use.ingestion.source.file.path = '/ingestion/path');
 

Table Name

<use_catalog>.<catalog_name>.system_lake_tables.ingestions

Description

This table contains the information about the ingestion process of a table.

Schema

root
 |-- write_id: string (nullable = true)
 |-- started_at: long (nullable = false)
 |-- finished_at: long (nullable = false)
 |-- destination_table: string (nullable = true)
 |-- number_of_input_files: long (nullable = false)
 |-- byte_size_input_files: long (nullable = false)
 |-- input_files: list(string) (nullable = true)
 |-- process_time_ms: long (nullable = true)
 |-- batch_max_timestamp: long (nullable = false)
 |-- batch_min_timestamp: long (nullable = false)
 |-- txn_app_id: string (nullable = true)
 |-- txn_version: string (nullable = true)

TableConsumption

The Consumption set up should be configured when using CREATE TABLE in the TBLPROPERTIES section or by the SET TBLPROPERTIES command.

// CREATE TABLE
CREATE TABLE t TBLRPROPERTIES(use.consumption.enabled='true')
// ALTER TABLE
ALTER TABLE t SET TBLPROPERTIES(use.consumption.enabled = 'true');
 

Table Name

<use_catalog>.system_lake_tables.consumptions

Description

This table contains the information about the consumption process of a table.

Schema

root
 |-- event_name: string (nullable = true)
 |-- event_source: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- region: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- event_time: timestamp (nullable = true)
 |-- bytes_in: long (nullable = false)
 |-- bytes_out: long (nullable = false)
 |-- object_path: string (nullable = true)

TableOptimization

DISCLAIMER: Right now, the optimization has to be triggered MANUALLY from CLI or spark-shell.

Please, read the Optimization Guide to know more about this feature.

Table Name

<use_catalog>.system_lake_tables.optimizations

Description

This table contains the information about the optimization process of a table.

Schema

root
 |-- optimization_name: string (nullable = true)
 |-- write_id: string (nullable = true)
 |-- table_id: string (nullable = true)
 |-- revision_id: long (nullable = true)
 |-- started_at: long (nullable = true)
 |-- finished_at: long (nullable = true)
 |-- input_files_stats: struct (nullable = true)
 |    |-- bytes: long (nullable = true)
 |    |-- cubes: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- elementCount: long (nullable = true)
 |    |-- files: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- numCubes: long (nullable = true)
 |    |-- numFiles: long (nullable = true)
 |-- output_files_stats: struct (nullable = true)
 |    |-- bytes: long (nullable = true)
 |    |-- cubes: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- elementCount: long (nullable = true)
 |    |-- files: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- numCubes: long (nullable = true)
 |    |-- numFiles: long (nullable = true)
 |-- process_time_ms: long (nullable = true)
 |-- cube_defragmentation_stats: struct (nullable = true)
 |    |-- average_cube_count_per_file_after: double (nullable = true)
 |    |-- average_cube_count_per_file_before: double (nullable = true)
 |    |-- average_cube_count_per_file_reduction: double (nullable = true)
 |    |-- average_block_element_count_per_file_after: double (nullable = true)
 |    |-- average_block_element_count_per_file_before: double (nullable = true)
 |    |-- average_block_element_count_per_file_increase: double (nullable = true)
 |-- sampling_overhead_reduction_stats: struct (nullable = true)
 |    |-- fraction: double (nullable = true)
 |    |-- table_bytes_before: long (nullable = true)
 |    |-- table_bytes_after: long (nullable = true)
 |    |-- sampled_bytes_before: long (nullable = true)
 |    |-- sampled_bytes_after: long (nullable = true)
 |    |-- sampled_fraction_before: double (nullable = true)
 |    |-- sampled_fraction_after: double (nullable = true)
 |    |-- bytes_to_optimize: long (nullable = true)
 |    |-- optimized_bytes: long (nullable = true)

SINGLE TABLE QUERIES

If you want only to load the file or history information for a single table, our ReadOnly Catalog allows to query based on Table Identifier.

For example, if you have a table named: "students", you can retrieve the metadata by executing:

SELECT * FROM <use_catalog>.spark_catalog.default.students.files // CURRENT added and deleted files
SELECT * FROM <use_catalog>.spark_catalog.default.students.all_files //ALL FILES (including deleted and purged)
SELECT * FROM <use_catalog>.spark_catalog.default.students.history
SELECT * FROM <use_catalog>.spark_catalog.default.students.ingestions
SELECT * FROM <use_catalog>.spark_catalog.default.students.consumptions

There's also a set of Metadata Tables which contains the information from single tables aggregated by hour:

SELECT * FROM <use_catalog>.spark_catalog.default.students.files_hourly // with CURRENT added and deleted files
SELECT * FROM <use_catalog>.spark_catalog.default.students.ingestions_hourly
SELECT * FROM <use_catalog>.spark_catalog.default.students.consumptions_hourly

Syntax can be reduced as:

SELECT * FROM <use_catalog>.<table_catalog_name>.<table_namespace_name>.<table_name>.<metadata_table_name>
All of them would be listed on the SHOW TABLES method.
SHOW TABLES IN <use_catalog>