Implementing a new Intervention Method

Implementing a new Intervention Method can be done by creating a subclass of the Abstract Base Class InterventionMethod.

Examples:

EasyEditInterventionMethod
LMDebuggerIntervention (note that this implementation deviates quite a bit from the intended way to implement an Intervention Method)

What do I need to implement? (Types of Intervention Methods)

An Intervention Method can either be implemented as a Hook-based Intervention Method or a Model-Transform Intervention Method.

General

Each Method should set the Attribute self.layers = [0] to the layer index, the Method mutates by default. This allows to index Intervention Methods in the Frontend and also cache a Model’s Weights to later undo Model-Transformations.

Hook-based Intervention Method

A Hook-based Intervention Method uses Features of a Model. Interventions place Hooks on specific Modules in the computational Graph of the Transformer Model to directly mutate the activations of Features.

A Hook-based Intervention Method implements the following Methods:

setup_intervention_hook
- Installs a Hook, given as a Parameter
- Use Method TransformerModelWrapper.setup_hook to install Hooks
get_projections
- Projects Features to Vocab-Space. Returns a Feature’s projected Tokens and Logit Values

Model-Transform Intervention Method

A Model-Transform Intervention Method transforms the Model-Weights of the Transformer Model. One Intervention could execute an arbitrary Model-Editing-Algorithm once.

A Model-Transform Intervention Method implements the following Methods:

get_text_inputs
- Returns a dict of Names of Text-Inputs (Keys) and standard-inputs (Values)
- This dict is sent to the Frontend, where the user populates it with inputs (e.g., prompt, subject, target)
transform_model
- Performs the Transformation of the Model’s Weights based on a given Intervention

Detailed Methods Explanation

get_name

By default, we use the name of this Subclass as the Name of this Intervention Method. By overriding this Method, a custom name can be set.

get_text_inputs

Returns a dict of Text Inputs, which are used to define an Intervention. The defined Text Inputs show up in the UI.

The following dict defines three Text-Inputs, that have empty standard-values.

def get_text_inputs(self):
    return {
        "prompt": "",
        "subject": "",
        "target": ""
    }

A set Intervention will have the same structure with filled out Dict-Values. Exemplary Intervention:

{
    "layer": 5, 
    "name": "ExampleInterventionMethod", 
    "type": "intervention", 
    "min_layer": 0, 
    "max_layer": 47, 
    "changeable_layer": True, 
    "docstring": "This is a descriptive docstring", 
    "text_inputs": {
        "prompt": "{} is a",
        "subject": "Barack Obama",
        "target": "human"
    },
    "coeff": 1
}

transform_model

Transforms the Model according to a given Intervention. Exemplary implementation:

def transform_model(self, intervention):
    # Skip disabled Interventions
    if intervention["coeff"] <= 0.0:
        return

    request = [{
        "prompt": intervention["text_inputs"]["prompt"],
        "subject": intervention["text_inputs"]["prompt"],
        "target_new": intervention["text_inputs"]["target"]
    }]
    
    # This is the invoke-method of an EasyEdit-Method
    rv = self.invoke_method(
        self.model_wrapper.model,
        self.model_wrapper.tokenizer,
        request,
        self.ee_hparams,
        copy=False
    )

    if isinstance(rv, tuple):
        edited_model = rv[0]
    else:
        edited_model = rv

    self.model_wrapper.model = edited_model

setup_intervention_hook

Sets an Intervention Hook according to a given Intervention. Exemplary implementation:

def setup_intervention_hooks(self, intervention: dict, prompt: str):
    def hook_mlp_acts(module, input, output):
        activation_vector = output
        f = autoencoder.forward_encoder(activation_vector)
        f[::, ::, 1234] = 42
        x_hat = autoencoder.forward_decoder(f)
        return x_hat

    self.model_wrapper.setup_hook(
        hook_mlp_acts,
        "model.layers.3.mlp"
    )

Set Hooks are automatically cleared after usage. (As long as permanent=False)

get_projections

The results of get_projections are shown in the side menu (ValueDetailsPanel), once a Feature of an Intervention is clicked.

def get_projections(self, dim, *args, **kwargs):
    return {
        "dim": 1278,
        "layer": 42,
        "top_k": [{
            "logit": 1.23,
            "token": "_my"
        }, ...]
    }

Additional info

The attributes min_layer and max_layer contain information on the first and last Layer of the Transformer, an Intervention Method can be applied to. By overwriting the Method get_changeable_layer, we can set if this Intervention Method’s Layer can be changed (by respecting min_layer & max_layer) in the Frontend.