Read / write to S3
A few additional steps are needed in order to process data from and to AWS S3. For more details, look at the Spark documentation.
Make sure your AWS keys are exported as environment variables
eval "$(aws configure export-credentials --profile default --format env)"
If you want to write the outputs to S3, make sure the target bucket exists
aws s3 mb s3://MY_TARGET_BUCKET
With Spark installed in local mode
Simply launch Spark in the normal way
spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/
where BUCKET_WITH_CURS/PATH/
is the bucket name and path of the input CURS and OUTPUT_BUCKET/PATH/
the bucket name and path for the output.
With Docker
You need to pass the environment keys and values to the container with -e
e.g.
docker run -it --rm --name spruce \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
ghcr.io/digitalpebble/spruce \
/opt/spark/bin/spark-submit \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/