Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quick start using Apache Spark

Instead of using a container, you can run Spruce directly on Apache Spark either locally or on a cluster.

Prerequisites

You will need to have CUR reports as inputs. Those are generated via DataExports and stored on S3 as Parquet files.

For this tutorial, we will assume that you copied the S3 files to your local file system. You can do this with the AWS CLI

aws s3 cp s3://bucket/path_to_curs dipe-curs --recursive

To run Spruce locally, you need Apache Spark installed and added to the $PATH.

Finally, you need the JAR containing the code and resources for Spruce. You can copy it from the latest release or alternatively, build from source, which requires Apache Maven and Java 17 or above.s

mvn clean package

Run on Apache Spark

If you downloaded a released jar, make sure the path matches its location.

spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i ./curs -o ./output

The -i parameter specifies the location of the directory containing the CUR reports in Parquet format. The -o parameter specifies the location of enriched Parquet files generated in output.

The option -c allows to specify a JSON configuration file to override the default settings.

The directory output contains an enriched copy of the input CURs. See Explore the results to understand what the output contains.