Implementing a new Intervention Method
Implementing a new Intervention Method can be done by creating a subclass of the Abstract Base Class InterventionMethod.
Examples:
EasyEditInterventionMethodLMDebuggerIntervention(note that this implementation deviates quite a bit from the intended way to implement an Intervention Method)
What do I need to implement? (Types of Intervention Methods)
An Intervention Method can either be implemented as a Hook-based Intervention Method or a Model-Transform Intervention Method.
General
Each Method should set the Attribute self.layers = [0] to the layer index, the Method mutates by default.
This allows to index Intervention Methods in the Frontend and also cache a Model’s Weights to later undo Model-Transformations.
Hook-based Intervention Method
A Hook-based Intervention Method uses Features of a Model. Interventions place Hooks on specific Modules in the computational Graph of the Transformer Model to directly mutate the activations of Features.
A Hook-based Intervention Method implements the following Methods:
setup_intervention_hookInstalls a Hook, given as a Parameter
Use Method
TransformerModelWrapper.setup_hookto install Hooks
get_projectionsProjects Features to Vocab-Space. Returns a Feature’s projected Tokens and Logit Values
Model-Transform Intervention Method
A Model-Transform Intervention Method transforms the Model-Weights of the Transformer Model. One Intervention could execute an arbitrary Model-Editing-Algorithm once.
A Model-Transform Intervention Method implements the following Methods:
get_text_inputsReturns a dict of Names of Text-Inputs (Keys) and standard-inputs (Values)
This dict is sent to the Frontend, where the user populates it with inputs (e.g., prompt, subject, target)
transform_modelPerforms the Transformation of the Model’s Weights based on a given Intervention
Detailed Methods Explanation
get_name
By default, we use the name of this Subclass as the Name of this Intervention Method. By overriding this Method, a custom name can be set.
get_text_inputs
Returns a dict of Text Inputs, which are used to define an Intervention. The defined Text Inputs show up in the UI.
The following dict defines three Text-Inputs, that have empty standard-values.
def get_text_inputs(self):
return {
"prompt": "",
"subject": "",
"target": ""
}
A set Intervention will have the same structure with filled out Dict-Values. Exemplary Intervention:
{
"layer": 5,
"name": "ExampleInterventionMethod",
"type": "intervention",
"min_layer": 0,
"max_layer": 47,
"changeable_layer": True,
"docstring": "This is a descriptive docstring",
"text_inputs": {
"prompt": "{} is a",
"subject": "Barack Obama",
"target": "human"
},
"coeff": 1
}
transform_model
Transforms the Model according to a given Intervention. Exemplary implementation:
def transform_model(self, intervention):
# Skip disabled Interventions
if intervention["coeff"] <= 0.0:
return
request = [{
"prompt": intervention["text_inputs"]["prompt"],
"subject": intervention["text_inputs"]["prompt"],
"target_new": intervention["text_inputs"]["target"]
}]
# This is the invoke-method of an EasyEdit-Method
rv = self.invoke_method(
self.model_wrapper.model,
self.model_wrapper.tokenizer,
request,
self.ee_hparams,
copy=False
)
if isinstance(rv, tuple):
edited_model = rv[0]
else:
edited_model = rv
self.model_wrapper.model = edited_model
setup_intervention_hook
Sets an Intervention Hook according to a given Intervention. Exemplary implementation:
def setup_intervention_hooks(self, intervention: dict, prompt: str):
def hook_mlp_acts(module, input, output):
activation_vector = output
f = autoencoder.forward_encoder(activation_vector)
f[::, ::, 1234] = 42
x_hat = autoencoder.forward_decoder(f)
return x_hat
self.model_wrapper.setup_hook(
hook_mlp_acts,
"model.layers.3.mlp"
)
Set Hooks are automatically cleared after usage. (As long as permanent=False)
get_projections
The results of get_projections are shown in the side menu (ValueDetailsPanel), once a Feature of an Intervention is clicked.
def get_projections(self, dim, *args, **kwargs):
return {
"dim": 1278,
"layer": 42,
"top_k": [{
"logit": 1.23,
"token": "_my"
}, ...]
}
Additional info
The attributes min_layer and max_layer contain information on the first and last Layer of the Transformer, an Intervention Method can be applied to.
By overwriting the Method get_changeable_layer, we can set if this Intervention Method’s Layer can be changed (by respecting min_layer & max_layer) in the Frontend.