Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The SPRUCE Logo

SPRUCE helps estimate the environmental impact of your cloud usage. By leveraging open source models and data, it enriches usage reports generated by cloud providers and allows you to build reports and visualisations. Having the GreenOps and FinOps data in the same place makes it easier to expose your costs and impacts side by side.

Please note that SPRUCE handles only CUR reports from AWS and not all their services are covered. However, most of the cost from a typical usage already gets estimates.

SPRUCE uses Apache Spark to read and write the usage reports (typically in Parquet format) in a scalable way and, thanks to its modular approach, splits the enrichment of the data into configurable stages.

A typical sequence of stages would be:

  • estimation of embodied emissions from the hardware
  • estimation of energy used
  • application of PUE and other overheads
  • application of carbon intensity factors

Have a look at the methodology section for more details.

One of the benefits of using Apache Spark is that you can use EMR on AWS to enrich the CURs at scale without having to export or expose any of your data.

The code of the project is in our GitHub repo.

Spruce is licensed under the Apache License, Version 2.0.

Quick start using Docker 🐳

Prerequisites

You will need to have CUR reports as inputs. Those are generated via DataExports and stored on S3 as Parquet files.

For this tutorial, we will assume that you copied the S3 files to your local file system. You can do this with the AWS CLI

aws s3 cp s3://bucket/path_to_curs dipe-curs --recursive

You will also need to have Docker installed.

With Docker

Pull the latest Docker image with

docker pull ghcr.io/digitalpebble/spruce

This retrieves a Docker image containing Apache Spark as well as the Spruce jar.

The command below processes the data locally by mounting the directories containing the CURs and output as volumes:

docker run -it -v ./curs:/curs -v ./output:/output --rm --name spruce --network host \
ghcr.io/digitalpebble/spruce \
/opt/spark/bin/spark-submit  \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i /curs -o /output/enriched

The -i parameter specifies the location of the directory containing the CUR reports in Parquet format. The -o parameter specifies the location of enriched Parquet files generated in output.

The option -c allows to specify a JSON configuration file to override the default settings.

The directory output contains an enriched copy of the input CURs. See Explore the results to understand what the output contains.

Quick start using Apache Spark

Instead of using a container, you can run Spruce directly on Apache Spark either locally or on a cluster.

Prerequisites

You will need to have CUR reports as inputs. Those are generated via DataExports and stored on S3 as Parquet files.

For this tutorial, we will assume that you copied the S3 files to your local file system. You can do this with the AWS CLI

aws s3 cp s3://bucket/path_to_curs dipe-curs --recursive

To run Spruce locally, you need Apache Spark installed and added to the $PATH.

Finally, you need the JAR containing the code and resources for Spruce. You can copy it from the latest release or alternatively, build from source, which requires Apache Maven and Java 17 or above.s

mvn clean package

Run on Apache Spark

If you downloaded a released jar, make sure the path matches its location.

spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i ./curs -o ./output

The -i parameter specifies the location of the directory containing the CUR reports in Parquet format. The -o parameter specifies the location of enriched Parquet files generated in output.

The option -c allows to specify a JSON configuration file to override the default settings.

The directory output contains an enriched copy of the input CURs. See Explore the results to understand what the output contains.

Explore the results

Using DuckDB locally (or Athena if the output was written to S3):

create table enriched_curs as select * from 'output/**/*.parquet';

select line_item_product_code, product_servicecode,
       round(sum(operational_emissions_co2eq_g)/1000,2) as co2_usage_kg,
       round(sum(embodied_emissions_co2eq_g)/1000, 2) as co2_embodied_kg,
       round(sum(energy_usage_kwh),2) as energy_usage_kwh
       from enriched_curs where operational_emissions_co2eq_g > 0.01
       group by line_item_product_code, product_servicecode
       order by co2_usage_kg desc, co2_embodied_kg desc, energy_usage_kwh desc, product_servicecode;

This should give an output similar to

line_item_product_codeproduct_servicecodeline_item_operationco2_usage_kgenergy_usage_kwhco2_embodied_kg
AmazonEC2AmazonEC2RunInstances538.31220.14303.41
AmazonECSAmazonECSFargateTask181.32399.05NULL
AmazonS3AmazonS3OneZoneIAStorage102.3225.15NULL
AmazonS3AmazonS3GlacierInstantRetrievalStorage75.89167.03NULL
AmazonEC2AmazonEC2CreateVolume-Gp341.6391.62NULL
AmazonS3AmazonS3StandardStorage28.5162.81NULL
AmazonDocDBAmazonDocDBCreateCluster19.7943.56NULL
AmazonECSAmazonECSECSTask-EC29.2620.37NULL
AmazonS3AmazonS3IntelligentTieringAIAStorage2.335.13NULL
AmazonEC2AmazonEC2CreateSnapshot2.315.82NULL
AmazonEC2AmazonEC2RunInstances:SV0011.793.940.78
AmazonS3AmazonS3StandardIAStorage1.192.61NULL
AmazonS3AWSDataTransferGetObjectForRepl1.172.58NULL
AmazonS3AWSDataTransferUploadPartForRepl1.012.22NULL
AmazonS3AmazonS3OneZoneIASizeOverhead0.891.96NULL
AmazonEC2AmazonEC2CreateVolume-Gp20.841.84NULL
AmazonEC2AWSDataTransferRunInstances0.180.39NULL
AmazonS3AWSDataTransferPutObjectForRepl0.160.36NULL
AmazonS3AmazonS3DeleteObject0.160.35NULL
AWSBackupAWSBackupStorage0.10.49NULL
AmazonMQAmazonMQCreateBroker:00010.020.04NULL
AmazonECRAWSDataTransferdownloadLayer0.010.01NULL
AmazonS3AWSDataTransferPutObject0.00.0NULL

To measure the proportion of the costs for which emissions where calculated

select
  round(covered * 100 / "total costs", 2) as percentage_costs_covered
from (
  select
    sum(line_item_unblended_cost) as "total costs",
    sum(line_item_unblended_cost) filter (where operational_emissions_co2eq_g is not null) as covered
  from
    enriched_curs
  where
    line_item_line_item_type like '%Usage'
);

Methodology

Spruce uses third-party resources and models to estimate the environmental impact of cloud services. It enriches cost usage reports (CUR) with additional columns, allowing users to do GreenOps and build dashboards and reports.

Unlike the information provided by CSPs (Cloud Service Providers), Spruce gives total transparency on how the estimates are built.

The overall approach is as follows:

  1. Estimate the energy used per activity (e.g for X GB of data transferred, usage of an EC2 instance, storage etc...)
  2. Add overheads (e.g. PUE, WUE)
  3. Apply accurate carbon intensity factors - ideally for a specific location at a specific time
  4. Where possible, estimate the embodied carbon related to the activity

This is compliant with the SCI specification from the GreenSoftware Foundation.

Enrichment modules

Spruce generates the estimates above by chaining EnrichmentModules, each of them relying on columns found in the usage reports or produced by preceding modules.

For instance, the AverageCarbonIntensity.java module applies average carbon intensity factors to energy estimates based on the region in order to generate operational emissions.

The list of columns generated by the modules can be found in the SpruceColumn class.

The enrichment modules are listed and configured in a configuration file. If no configuration is specified, the default one is used.

ccf.Storage

Provides an estimate of energy used for storage by applying a flat coefficient per Gb, following the approach used by the Cloud Carbon Footprint project. See methodology for more details.

Populates the column energy_usage_kwh.

ccf.Networking

Provides an estimate of energy used for networking in and out of data centres. Applies a flat coefficient per Gb, following the approach used by the Cloud Carbon Footprint project. See methodology for more details.

Populates the column energy_usage_kwh.

boavizta.BoaviztAPI

Provides an estimate of final energy used for computation (EC2, OpenSearch, RDS) as well as the related embodied emissions using the BoaviztAPI.

Populates the column energy_usage_kwh and embodied_emissions_co2eq_g.

boavizta.BoaviztAPIstatic

Similar to the previous module but does not get the information from an instance of the BoaviztAPI but from a static file generated from it. This makes it simpler to use Spruce.

ccf.PUE

Applies a fixed ratio for Power Usage_Effectiveness to row for which energy usage has been estimated, following the approach used by the Cloud Carbon Footprint project. See CCF methodology for more details.

Populates the column power_usage_effectiveness.

electricitymaps.AverageCarbonIntensity

Adds average carbon intensity factors generated from ElectricityMaps' 2024 datasets. The life-cycle emission factors are used.

Populates the columns carbon_intensity.

OperationalEmissions

Computes operational emissions based on the energy usage, average carbon intensity factors and power_usage_effectiveness estimated by the preceding modules.

Populates the columns operational_emissions_co2eq_g.

operational_emissions_co2eq_g is equal to energy_usage_kwh * carbon_intensity * 'power_usage_effectiveness'.

Read / write to S3

A few additional steps are needed in order to process data from and to AWS S3. For more details, look at the Spark documentation.

Make sure your AWS keys are exported as environment variables

eval "$(aws configure export-credentials --profile default --format env)"

If you want to write the outputs to S3, make sure the target bucket exists

 aws s3 mb s3://MY_TARGET_BUCKET

With Spark installed in local mode

Simply launch Spark in the normal way

spark-submit --class com.digitalpebble.spruce.SparkJob --driver-memory 8g ./target/spruce-*.jar -i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/

where BUCKET_WITH_CURS/PATH/ is the bucket name and path of the input CURS and OUTPUT_BUCKET/PATH/ the bucket name and path for the output.

With Docker

You need to pass the environment keys and values to the container with -e e.g.

docker run -it --rm --name spruce \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
ghcr.io/digitalpebble/spruce  \
/opt/spark/bin/spark-submit  \
--class com.digitalpebble.spruce.SparkJob \
--driver-memory 4g \
--master 'local[*]' \
/usr/local/lib/spruce.jar \
-i s3a://BUCKET_WITH_CURS/PATH/ -o s3a://OUTPUT_BUCKET/PATH/

Modules configuration

The enrichment modules are configured in a file called default-config.json. This file is included in the JAR and looks like this:

{
  "modules": [
    {
      "className": "com.digitalpebble.spruce.modules.ccf.Storage",
      "config": {
        "hdd_gb_coefficient": 0.65,
        "ssd_gb_coefficient": 1.2
      }
    },
    {
      "className": "com.digitalpebble.spruce.modules.ccf.Networking",
      "config": {
        "network_coefficient": 0.001
      }
    },
    {
      "className": "com.digitalpebble.spruce.modules.boavizta.BoaviztAPIstatic"
    },
    {
      "className": "com.digitalpebble.spruce.modules.ccf.PUE"
    },
    {
      "className": "com.digitalpebble.spruce.modules.electricitymaps.AverageCarbonIntensity"
    },
    {
      "className": "com.digitalpebble.spruce.modules.OperationalEmissions"
    }
  ]
}

This determines which modules are used and in what order but also configures their behaviour. For instance, the default coefficient set for the ccf.Networking module is 0.001 kWh/Gb.

Change the configuration

In order to use a different configuration, for instance to replace a module with an other one, or change their configuration (like the network coefficient above), you simply need to write a json file with your changes and pass it as an argument to the Spark job with '-c'.

Contributing to Spruce

============================

Thank you for your intention to contribute to Spruce. As an open-source community, we highly appreciate contributions to our project.

To make the process smooth for the project committers (those who review and accept changes) and contributors (those who propose new changes via pull requests), there are a few rules to follow.

Contribution Guidelines

We use GitHub Issues and Pull Requests for tracking contributions. We expect participants to adhere to the GitHub Community Guidelines (found at https://help.github.com/articles/github-community-guidelines/ ) as well as our Code of Conduct.

Please note that your contributions will be under the ASF v2 license.

Get Involved

The Spruce project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component. We especially welcome contributions from first-time users.

GitHub Discussions

Feel free to use GitHub Discussions to ask any questions you might have when planning your first contribution.

Making a Contribution

  • Create a new issue on GitHub. Please describe the problem or improvement in the body of the issue. For larger issues, please open a new discussion and describe the problem.
  • Next, create a pull request in GitHub.

Contributors who have a history of successful participation are invited to join the project as a committer.