SPRUCE helps estimate the environmental impact of your cloud usage. By leveraging open source models and data, it enriches usage reports generated by cloud providers and allows you to build reports and visualisations. Having the GreenOps and FinOps data in the same place makes it easier to expose your costs and impacts side by side.
Please note that SPRUCE handles only CUR reports from AWS and not all their services are covered. However, most of the cost from a typical usage already gets estimates.
SPRUCE uses Apache Spark to read and write the usage reports (typically in Parquet format) in a scalable way and, thanks to its modular approach, splits the enrichment of the data into configurable stages.
A typical sequence of stages would be:
- estimation of embodied emissions from the hardware
- estimation of energy used
- application of PUE and other overheads
- application of carbon intensity factors
Have a look at the methodology section for more details.
One of the benefits of using Apache Spark is that you can use EMR on AWS to enrich the CURs at scale without having to export or expose any of your data.
The code of the project is in our GitHub repo.
Spruce is licensed under the Apache License, Version 2.0.
Quick start using Docker 🐳
Prerequisites
You will need to have CUR reports as inputs. Those are generated via DataExports and stored on S3 as Parquet files.
For this tutorial, we will assume that you copied the S3 files to your local file system. You can do this with the AWS CLI
aws s3 cp s3://bucket/path_to_curs dipe-curs --recursive
You will also need to have Docker installed.
With Docker
Pull the latest Docker image with
docker pull ghcr.io/digitalpebble/spruce
This retrieves a Docker image containing Apache Spark as well as the Spruce jar.
The command below processes the data locally by mounting the directories containing the CURs and output as volumes:
docker run -it -v ./curs:/curs -v ./output:/output --rm --name spruce --network host \
ghcr.io/digitalpebble/spruce \
/opt/spark/bin/spark-submit \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i /curs -o /output/enriched
The -i
parameter specifies the location of the directory containing the CUR reports in Parquet format.
The -o
parameter specifies the location of enriched Parquet files generated in output.
The option -c
allows to specify a JSON configuration file to override the default settings.
The directory output contains an enriched copy of the input CURs. See Explore the results to understand what the output contains.
Quick start using Apache Spark
Instead of using a container, you can run Spruce directly on Apache Spark either locally or on a cluster.
Prerequisites
You will need to have CUR reports as inputs. Those are generated via DataExports and stored on S3 as Parquet files.
For this tutorial, we will assume that you copied the S3 files to your local file system. You can do this with the AWS CLI
aws s3 cp s3://bucket/path_to_curs dipe-curs --recursive
To run Spruce locally, you need Apache Spark installed and added to the $PATH.
Finally, you need the JAR containing the code and resources for Spruce. You can copy it from the latest release or alternatively, build from source, which requires Apache Maven and Java 17 or above.s
mvn clean package
Run on Apache Spark
If you downloaded a released jar, make sure the path matches its location.
spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i ./curs -o ./output
The -i
parameter specifies the location of the directory containing the CUR reports in Parquet format.
The -o
parameter specifies the location of enriched Parquet files generated in output.
The option -c
allows to specify a JSON configuration file to override the default settings.
The directory output contains an enriched copy of the input CURs. See Explore the results to understand what the output contains.
Explore the results
Using DuckDB locally (or Athena if the output was written to S3):
create table enriched_curs as select * from 'output/**/*.parquet';
select line_item_product_code, product_servicecode,
round(sum(operational_emissions_co2eq_g)/1000,2) as co2_usage_kg,
round(sum(embodied_emissions_co2eq_g)/1000, 2) as co2_embodied_kg,
round(sum(energy_usage_kwh),2) as energy_usage_kwh
from enriched_curs where operational_emissions_co2eq_g > 0.01
group by line_item_product_code, product_servicecode
order by co2_usage_kg desc, co2_embodied_kg desc, energy_usage_kwh desc, product_servicecode;
This should give an output similar to
line_item_product_code | product_servicecode | line_item_operation | co2_usage_kg | energy_usage_kwh | co2_embodied_kg |
---|---|---|---|---|---|
AmazonEC2 | AmazonEC2 | RunInstances | 538.3 | 1220.14 | 303.41 |
AmazonECS | AmazonECS | FargateTask | 181.32 | 399.05 | NULL |
AmazonS3 | AmazonS3 | OneZoneIAStorage | 102.3 | 225.15 | NULL |
AmazonS3 | AmazonS3 | GlacierInstantRetrievalStorage | 75.89 | 167.03 | NULL |
AmazonEC2 | AmazonEC2 | CreateVolume-Gp3 | 41.63 | 91.62 | NULL |
AmazonS3 | AmazonS3 | StandardStorage | 28.51 | 62.81 | NULL |
AmazonDocDB | AmazonDocDB | CreateCluster | 19.79 | 43.56 | NULL |
AmazonECS | AmazonECS | ECSTask-EC2 | 9.26 | 20.37 | NULL |
AmazonS3 | AmazonS3 | IntelligentTieringAIAStorage | 2.33 | 5.13 | NULL |
AmazonEC2 | AmazonEC2 | CreateSnapshot | 2.31 | 5.82 | NULL |
AmazonEC2 | AmazonEC2 | RunInstances:SV001 | 1.79 | 3.94 | 0.78 |
AmazonS3 | AmazonS3 | StandardIAStorage | 1.19 | 2.61 | NULL |
AmazonS3 | AWSDataTransfer | GetObjectForRepl | 1.17 | 2.58 | NULL |
AmazonS3 | AWSDataTransfer | UploadPartForRepl | 1.01 | 2.22 | NULL |
AmazonS3 | AmazonS3 | OneZoneIASizeOverhead | 0.89 | 1.96 | NULL |
AmazonEC2 | AmazonEC2 | CreateVolume-Gp2 | 0.84 | 1.84 | NULL |
AmazonEC2 | AWSDataTransfer | RunInstances | 0.18 | 0.39 | NULL |
AmazonS3 | AWSDataTransfer | PutObjectForRepl | 0.16 | 0.36 | NULL |
AmazonS3 | AmazonS3 | DeleteObject | 0.16 | 0.35 | NULL |
AWSBackup | AWSBackup | Storage | 0.1 | 0.49 | NULL |
AmazonMQ | AmazonMQ | CreateBroker:0001 | 0.02 | 0.04 | NULL |
AmazonECR | AWSDataTransfer | downloadLayer | 0.01 | 0.01 | NULL |
AmazonS3 | AWSDataTransfer | PutObject | 0.0 | 0.0 | NULL |
To measure the proportion of the costs for which emissions where calculated
select
round(covered * 100 / "total costs", 2) as percentage_costs_covered
from (
select
sum(line_item_unblended_cost) as "total costs",
sum(line_item_unblended_cost) filter (where operational_emissions_co2eq_g is not null) as covered
from
enriched_curs
where
line_item_line_item_type like '%Usage'
);
Methodology
Spruce uses third-party resources and models to estimate the environmental impact of cloud services. It enriches cost usage reports (CUR) with additional columns, allowing users to do GreenOps and build dashboards and reports.
Unlike the information provided by CSPs (Cloud Service Providers), Spruce gives total transparency on how the estimates are built.
The overall approach is as follows:
- Estimate the energy used per activity (e.g for X GB of data transferred, usage of an EC2 instance, storage etc...)
- Add overheads (e.g. PUE, WUE)
- Apply accurate carbon intensity factors - ideally for a specific location at a specific time
- Where possible, estimate the embodied carbon related to the activity
This is compliant with the SCI specification from the GreenSoftware Foundation.
Enrichment modules
Spruce generates the estimates above by chaining EnrichmentModules, each of them relying on columns found in the usage reports or produced by preceding modules.
For instance, the AverageCarbonIntensity.java module applies average carbon intensity factors to energy estimates based on the region in order to generate operational emissions.
The list of columns generated by the modules can be found in the SpruceColumn class.
The enrichment modules are listed and configured in a configuration file. If no configuration is specified, the default one is used.
ccf.Storage
Provides an estimate of energy used for storage by applying a flat coefficient per Gb, following the approach used by the Cloud Carbon Footprint project. See methodology for more details.
Populates the column energy_usage_kwh
.
ccf.Networking
Provides an estimate of energy used for networking in and out of data centres. Applies a flat coefficient per Gb, following the approach used by the Cloud Carbon Footprint project. See methodology for more details.
Populates the column energy_usage_kwh
.
boavizta.BoaviztAPI
Provides an estimate of final energy used for computation (EC2, OpenSearch, RDS) as well as the related embodied emissions using the BoaviztAPI.
Populates the column energy_usage_kwh
and embodied_emissions_co2eq_g
.
boavizta.BoaviztAPIstatic
Similar to the previous module but does not get the information from an instance of the BoaviztAPI but from a static file generated from it. This makes it simpler to use Spruce.
ccf.PUE
Applies a fixed ratio for Power Usage_Effectiveness to row for which energy usage has been estimated, following the approach used by the Cloud Carbon Footprint project. See CCF methodology for more details.
Populates the column power_usage_effectiveness
.
electricitymaps.AverageCarbonIntensity
Adds average carbon intensity factors generated from ElectricityMaps' 2024 datasets. The life-cycle emission factors are used.
Populates the columns carbon_intensity
.
OperationalEmissions
Computes operational emissions based on the energy usage, average carbon intensity factors and power_usage_effectiveness
estimated by the preceding modules.
Populates the columns operational_emissions_co2eq_g
.
operational_emissions_co2eq_g
is equal to energy_usage_kwh
* carbon_intensity
* 'power_usage_effectiveness'.
Read / write to S3
A few additional steps are needed in order to process data from and to AWS S3. For more details, look at the Spark documentation.
Make sure your AWS keys are exported as environment variables
eval "$(aws configure export-credentials --profile default --format env)"
If you want to write the outputs to S3, make sure the target bucket exists
aws s3 mb s3://MY_TARGET_BUCKET
With Spark installed in local mode
Simply launch Spark in the normal way
spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/
where BUCKET_WITH_CURS/PATH/
is the bucket name and path of the input CURS and OUTPUT_BUCKET/PATH/
the bucket name and path for the output.
With Docker
You need to pass the environment keys and values to the container with -e
e.g.
docker run -it --rm --name spruce \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
ghcr.io/digitalpebble/spruce \
/opt/spark/bin/spark-submit \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/
Modules configuration
The enrichment modules are configured in a file called default-config.json
. This file is included in the JAR and looks like this:
{
"modules": [
{
"className": "com.digitalpebble.spruce.modules.ccf.Storage",
"config": {
"hdd_gb_coefficient": 0.65,
"ssd_gb_coefficient": 1.2
}
},
{
"className": "com.digitalpebble.spruce.modules.ccf.Networking",
"config": {
"network_coefficient": 0.001
}
},
{
"className": "com.digitalpebble.spruce.modules.boavizta.BoaviztAPIstatic"
},
{
"className": "com.digitalpebble.spruce.modules.ccf.PUE"
},
{
"className": "com.digitalpebble.spruce.modules.electricitymaps.AverageCarbonIntensity"
},
{
"className": "com.digitalpebble.spruce.modules.OperationalEmissions"
}
]
}
This determines which modules are used and in what order but also configures their behaviour. For instance, the default coefficient set for the ccf.Networking module is 0.001 kWh/Gb.
Change the configuration
In order to use a different configuration, for instance to replace a module with an other one, or change their configuration (like the network coefficient above), you simply need to write a json file with your changes and pass it as an argument to the Spark job with '-c'.
Contributing to Spruce
============================
Thank you for your intention to contribute to Spruce. As an open-source community, we highly appreciate contributions to our project.
To make the process smooth for the project committers (those who review and accept changes) and contributors (those who propose new changes via pull requests), there are a few rules to follow.
Contribution Guidelines
We use GitHub Issues and Pull Requests for tracking contributions. We expect participants to adhere to the GitHub Community Guidelines (found at https://help.github.com/articles/github-community-guidelines/ ) as well as our Code of Conduct.
Please note that your contributions will be under the ASF v2 license.
Get Involved
The Spruce project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component. We especially welcome contributions from first-time users.
GitHub Discussions
Feel free to use GitHub Discussions to ask any questions you might have when planning your first contribution.
Making a Contribution
- Create a new issue on GitHub. Please describe the problem or improvement in the body of the issue. For larger issues, please open a new discussion and describe the problem.
- Next, create a pull request in GitHub.
Contributors who have a history of successful participation are invited to join the project as a committer.