Read / write to S3
A few additional steps are needed in order to process data from and to AWS S3. For more details, look at the Spark documentation.
Make sure your AWS keys are exported as environment variables
eval "$(aws configure export-credentials --profile default --format env)"
If you want to write the outputs to S3, make sure the target bucket exists
aws s3 mb s3://MY_TARGET_BUCKET
You will need to uncomment the dependency
<artifactId>spark-hadoop-cloud_2.13</artifactId>
in the file pom.xml in order to be able to connect to S3.
With Spark installed in local mode
You need to recompile the jar file with mvn clean package.
then launch Spark in the normal way
spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/
where BUCKET_WITH_CURS/PATH/ is the bucket name and path of the input CURS and OUTPUT_BUCKET/PATH/ the bucket name and path for the output.
With Docker
After you edit the pom.xml (see above), you need to build a new Docker image
docker build . -t spruce_s3
You need to pass the environment keys and values to the container with -e e.g.
docker run -it --rm --name spruce \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
spruce_s3 \
/opt/spark/bin/spark-submit \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/