Read / write to S3

A few additional steps are needed in order to process data from and to AWS S3. For more details, look at the Spark documentation.

Make sure your AWS keys are exported as environment variables

eval "$(aws configure export-credentials --profile default --format env)"

If you want to write the outputs to S3, make sure the target bucket exists

 aws s3 mb s3://MY_TARGET_BUCKET

You will need to uncomment the dependency

<artifactId>spark-hadoop-cloud_2.13</artifactId>

in the file pom.xml in order to be able to connect to S3.

With Spark installed in local mode

You need to recompile the jar file with mvn clean package.

then launch Spark in the normal way

spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/

where BUCKET_WITH_CURS/PATH/ is the bucket name and path of the input CURS and OUTPUT_BUCKET/PATH/ the bucket name and path for the output.

With Docker

After you edit the pom.xml (see above), you need to build a new Docker image

docker build . -t spruce_s3

You need to pass the environment keys and values to the container with -e e.g.

docker run -it --rm --name spruce \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
spruce_s3  \
/opt/spark/bin/spark-submit  \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/

Keyboard shortcuts

SPRUCE

Read / write to S3

With Spark installed in local mode

With Docker