Contribution Guide
Welcome to the Qbeast community! Nice to meet you :)
Here's a summary of what you can find in this page:
- Introduction
- Version control branching
- PRs and Issues
- Style and formatting
- Logging Documentation
- Developing and contributing
- Publishing Guide
- Versioning
- Licensing of contributed material
- Community Values
Introduction
Either you want to know more about our guidelines or open a Pull Request, this is your page. We are pleased to help you through the different steps for contributing to our (your) project.
To find Qbeast issues that make good entry points:
-
Start with issues labelled
. For example, see the good first issues in the repository for updates to the core Qbeast Spark code.
-
For issues that require deeper knowledge of one or more technical aspects, look at issues labelled
.
Version control branching
- Always make a new branch for your work, no matter how small
- Don´t submit unrelated changes to the same branch/pull request
- Base your new branch off of the appropriate branch on the main repository
- New features should branch off of the
main
branch
PRs and Issues
To open and merge PRs, the following is to be respected:
- Always open an issue for the PR you're working with as much detail as possible.
- Every PR should have an issue that it is trying to address, ideally one.
- The title of the PR should follow the schema: Issue <issue-number>: <PR-title>
- Ideally, there should be at least two reviewers per PR
- The author of the PR never gets to merge the PR; a PR can only be merged by a reviewer.
- Do a Squash and Merge instead of merge, again, by a reviewer.
- Make sure the commit messages in the Squash Merge are clear and concise and reflect all major changes introduced by the PR.
Style and formatting
-
We follow Scalastyle for coding style in Scala. It runs at compile time, but you can check it manually with:
sbt scalastyle
-
Scalafmt is used for code formatting. You can configure your IDE to reformat at save:
Or alternatively force code formatting:
sbt scalafmt # Format main sources
sbt test:scalafmt # Format test sources
sbt scalafmtCheck # Check if the scala sources under the project have been formatted
sbt scalafmtSbt # Format *.sbt and project/*.scala files
sbt scalafmtSbtCheck # Check if the files have been formatted by scalafmtSbt -
Sbt also checks the format of the Scala docs when publishing the artifacts. The following command will check and generate the Scaladocs:
sbt doc
Logging Documentation
- We use the Spark Logging interface.
- Spark uses
log4j
for logging. You can configure it by adding alog4j2.properties
file in the conf directory. One way to start is to copy the existinglog4j2.properties.template
located there.
An example of using logging on a class is:
import org.apache.spark.internal.Logging
case class MyClass() extends Logging \{
def myMethod(): Unit = \{
logInfo("This is an info message")
logWarn("This is a warning message")
logError("This is an error message")
logTrace("This is a trace message")
logDebug("This is a debug message")
\}
\}
The following log levels are used to track code behaviour:
-
WARN
level is supposed to be critical and actionable. If the user sees a WARN, then something bad happened and it might require user intervention.Example on
DeltaMetadataWriter
class:def writeWithTransaction(writer: => (TableChanges, Seq[FileAction])): Unit = \{
// [...] Code to write the transaction [...]
if (txn.appId == appId && version <= txn.version) \{
val message = s"Transaction $version from application $appId is already completed," +
" the requested write is ignored"
logWarn(message)
return
\}
\} -
INFO
level provides information about the execution, but not necessarily actionable and it avoids being verbose. It is not uncommon to see INFO level on in production, so it is expected to be lightweight with respect to the volume of messages generated.Example on
BlockWriter
class:def writeRow(rows: Iterator[InternalRow]): Iterator[(AddFile, TaskStats)] = \{
// [...] Code to write the rows [...]
logInfo(s"Adding file $\{file.path\}")
\} -
DEBUG
provides debug level info when debugging the code. It can be verbose as it is not expected to be on in production.Example on
IndexedTable
class:if (isNewRevision(options)) \{
// Merging revisions code
logDebug(
s"Merging transformations for table $tableID with cubeSize=$newRevisionCubeSize")
// Code to merge revisions
\} -
TRACE
provides further detail to DEBUG on execution paths, and in particular, it indicates the execution of critical methods.Example on
IndexedTable
class:def doWrite(
data: DataFrame,
indexStatus: IndexStatus,
options: QbeastOptions,
append: Boolean): Unit = \{
logTrace(s"Begin: Writing data to table $tableID")
// [...] Code to write the data [...]
logTrace(s"End: Writing data to table $tableID")
\}
We should enforce all the Pull Request, specially those containing critical code, to have logging messages that are meaningful and informative.
Developing and contributing
Development set up
1. Install sbt(>=1.4.7).
2. Install Spark
Download Spark 3.3.0 with Hadoop 3.3*, unzip it, and create the SPARK_HOME
environment variable:
*: You can use Hadoop 2.7 or 3.2 if desired, but you could have some troubles with different cloud providers' storage, read more about it here.
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzvf spark-3.5.0-bin-hadoop3.tgz
export SPARK_HOME=$PWD/spark-3.5.0-bin-hadoop3
3. Project packaging:
Navigate to the repository folder and package the project using sbt. JDK 8 is recommended.
ℹ️ Note: You can specify custom Spark or Hadoop versions when packaging by using
-Dspark.version=3.5.0
or-Dhadoop.version=2.7.4
when runningsbt assembly
. If you have troubles with the versions you use, don't hesitate to ask the community in GitHub discussions.
cd qbeast-spark
sbt assembly
This code generates a fat jar with all required dependencies (or most of them) shaded.
The jar does not include scala nor spark nor delta, and it is supposed to be used inside spark.
For example:
-
Delta:
sbt assembly
$SPARK_HOME/bin/spark-shell \
--jars ./target/scala-2.12/qbeast-spark-assembly-0.8.0-SNAPSHOT.jar \
--packages io.delta:delta-spark_2.12:3.1.0 \
--conf spark.sql.extensions=io.qbeast.sql.DeltaQbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.DeltaQbeastCatalog -
Hudi:
sbt assembly
$SPARK_HOME/bin/spark-shell \
--jars ./target/scala-2.12/qbeast-spark-assembly-0.8.0-SNAPSHOT.jar \
--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:1.0.0 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
--conf spark.sql.extensions=io.qbeast.sql.HudiQbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.catalog.HudiQbeastCatalog
4. Publishing artefacts in the local repository
Sometimes it is convenient to have custom versions of the library to be
published in the local repository like IVy or Maven. For local Ivy (~/.ivy2
)
use
sbt publishLocal
For local Maven (~/.m2) use
sbt publishM2
Developer documentation
You can find the developer documentation (Scala docs) in the https://docs.qbeast.io/.