Write a New Module

An EnrichmentModule is the unit of extension in SPRUCE. Each module reads columns from the CUR input row and/or from values set by earlier modules, then writes its results into a shared map. The pipeline materialises one output row per CUR row at the end, avoiding per-module row copies.

Implement the interface

Create a class that implements com.digitalpebble.spruce.EnrichmentModule:

package com.example.spruce.modules;

import com.digitalpebble.spruce.Column;
import com.digitalpebble.spruce.EnrichmentModule;
import org.apache.spark.sql.Row;

import java.util.Map;

import static com.digitalpebble.spruce.CURColumn.LINE_ITEM_PRODUCT_CODE;
import static com.digitalpebble.spruce.CURColumn.USAGE_AMOUNT;
import static com.digitalpebble.spruce.SpruceColumn.ENERGY_USED;

public class MyServiceEnergy implements EnrichmentModule {

    private double coefficient = 0.005; // kWh per usage unit

    @Override
    public void init(Map<String, Object> params) {
        Double value = (Double) params.get("coefficient");
        if (value != null) {
            coefficient = value;
        }
    }

    @Override
    public Column[] columnsNeeded() {
        return new Column[]{LINE_ITEM_PRODUCT_CODE, USAGE_AMOUNT};
    }

    @Override
    public Column[] columnsAdded() {
        return new Column[]{ENERGY_USED};
    }

    @Override
    public void enrich(Row row, Map<Column, Object> enrichedValues) {
        String productCode = LINE_ITEM_PRODUCT_CODE.getString(row);
        if (!"MyServiceCode".equals(productCode)) return;

        double amount = USAGE_AMOUNT.getDouble(row);
        enrichedValues.put(ENERGY_USED, amount * coefficient);
    }
}

`columnsNeeded()` and `columnsAdded()`

These declare the module’s dependencies and outputs. SPRUCE uses them to validate the schema at startup — it will fail fast if a required column is missing from the CUR report.

columnsNeeded() should list every column you read, whether from the original CUR row (CURColumn) or from the shared enrichment map (SpruceColumn).

`init()`

Called once per partition before any rows are processed. Use it to load configuration values passed via the JSON config, or to read resource files. The params map is an empty map (never null) when no config block is present in the JSON.

If your module’s behaviour depends on the active cloud provider (for instance, if it performs region-keyed lookups against a CSV that holds entries for several CSPs), override the provider-aware overload instead:

@Override
public void init(Map<String, Object> params, Provider provider) {
    this.provider = provider;
    init(params);
}

The default implementation delegates to the single-arg init(Map), so modules that don’t care about the provider don’t need to do anything.

`enrich()`

Called once per usage row (tax, discount and fee rows are filtered out by the pipeline). Two sources of data are available:

Reading CUR input columns — use the typed getters on CURColumn:

String productCode = LINE_ITEM_PRODUCT_CODE.getString(row);
double amount      = USAGE_AMOUNT.getDouble(row);
boolean missing    = USAGE_AMOUNT.isNullAt(row);

// Optional field — returns null instead of throwing if absent from schema:
String region = PRODUCT_REGION_CODE.getString(row, /* optional= */ true);

Reading values set by earlier modules — use the typed getters on SpruceColumn:

Double energyUsed = ENERGY_USED.getDouble(enrichedValues);
String region     = REGION.getString(enrichedValues);

Both getters return null when the value is absent, so always null-check before using them.

Writing results — put values into the shared map:

enrichedValues.put(ENERGY_USED, computedValue);

If a module only runs for certain rows, simply return early without putting anything into the map. Columns not written remain null in the output row.

Do not call row.fieldIndex() directly. Use CURColumn and SpruceColumn getters, which cache the field index per schema and avoid repeated lookups across rows in the same partition.

Register the module

Modules are registered in a JSON config file. Copy the per-provider default (e.g. default-config-aws.json) from the JAR (or from the repository) as a starting point, then add your module:

{
  "modules": [
    { "className": "com.digitalpebble.spruce.modules.RegionExtraction" },
    { "className": "com.example.spruce.modules.MyServiceEnergy",
      "config": { "coefficient": 0.007 } },
    { "className": "com.digitalpebble.spruce.modules.PUE" },
    { "className": "com.digitalpebble.spruce.modules.ember.AverageCarbonIntensity" },
    { "className": "com.digitalpebble.spruce.modules.OperationalEmissions" }
  ]
}

See instructions on Configure the modules.

Pass the config file to the Spark job with -c:

spark-submit --class com.digitalpebble.spruce.SparkJob ./target/spruce-*.jar \
  -i ./curs -o ./output -c ./my-config.json

Module order matters. A module can only read values from the enrichment map that were written by modules listed earlier in the chain. For example, PUE must run after any module that populates operational_energy_kwh, and OperationalEmissions must run last.

Include your module in the build

If your module lives outside the SPRUCE source tree, build it as a JAR and add it to the Spark job’s classpath:

spark-submit --class com.digitalpebble.spruce.SparkJob \
  --jars ./my-module.jar \
  ./target/spruce-*.jar \
  -i ./curs -o ./output -c ./my-config.json

If you are adding the module directly to the SPRUCE source tree, place it under src/main/java/com/digitalpebble/spruce/modules/ and run mvn package to include it in the fat JAR.

See Contribute to SPRUCE if you would like to share the module with the community.

Keyboard shortcuts

SPRUCE