# Implementing a new Intervention Method Implementing a new Intervention Method can be done by creating a subclass of the Abstract Base Class `InterventionMethod`. Examples: * `EasyEditInterventionMethod` * `LMDebuggerIntervention` (note that this implementation deviates quite a bit from the intended way to implement an Intervention Method) ### What do I need to implement? (Types of Intervention Methods) An Intervention Method can either be implemented as a **Hook-based Intervention Method** or a **Model-Transform Intervention Method**. #### General Each Method should set the Attribute `self.layers = [0]` to the layer index, the Method mutates by default. This allows to index Intervention Methods in the Frontend and also cache a Model's Weights to later undo Model-Transformations. #### Hook-based Intervention Method A **Hook-based Intervention Method** uses Features of a Model. Interventions place Hooks on specific Modules in the computational Graph of the Transformer Model to directly mutate the activations of Features. A **Hook-based Intervention Method** implements the following Methods: * `setup_intervention_hook` * Installs a Hook, given as a Parameter * Use Method `TransformerModelWrapper.setup_hook` to install Hooks * `get_projections` * Projects Features to Vocab-Space. Returns a Feature's projected Tokens and Logit Values #### Model-Transform Intervention Method A **Model-Transform Intervention Method** transforms the Model-Weights of the Transformer Model. One Intervention could execute an arbitrary Model-Editing-Algorithm once. A **Model-Transform Intervention Method** implements the following Methods: * `get_text_inputs` * Returns a dict of Names of Text-Inputs (Keys) and standard-inputs (Values) * This dict is sent to the Frontend, where the user populates it with inputs (e.g., prompt, subject, target) * `transform_model` * Performs the Transformation of the Model's Weights based on a given Intervention ### Detailed Methods Explanation #### get_name By default, we use the name of this Subclass as the Name of this Intervention Method. By overriding this Method, a custom name can be set. #### get_text_inputs Returns a dict of Text Inputs, which are used to define an Intervention. The defined Text Inputs show up in the UI. The following dict defines three Text-Inputs, that have empty standard-values. ```python def get_text_inputs(self): return { "prompt": "", "subject": "", "target": "" } ``` A set Intervention will have the same structure with filled out Dict-Values. Exemplary Intervention: ```python { "layer": 5, "name": "ExampleInterventionMethod", "type": "intervention", "min_layer": 0, "max_layer": 47, "changeable_layer": True, "docstring": "This is a descriptive docstring", "text_inputs": { "prompt": "{} is a", "subject": "Barack Obama", "target": "human" }, "coeff": 1 } ``` #### transform_model Transforms the Model according to a given Intervention. Exemplary implementation: ```python def transform_model(self, intervention): # Skip disabled Interventions if intervention["coeff"] <= 0.0: return request = [{ "prompt": intervention["text_inputs"]["prompt"], "subject": intervention["text_inputs"]["prompt"], "target_new": intervention["text_inputs"]["target"] }] # This is the invoke-method of an EasyEdit-Method rv = self.invoke_method( self.model_wrapper.model, self.model_wrapper.tokenizer, request, self.ee_hparams, copy=False ) if isinstance(rv, tuple): edited_model = rv[0] else: edited_model = rv self.model_wrapper.model = edited_model ``` #### setup_intervention_hook Sets an Intervention Hook according to a given Intervention. Exemplary implementation: ```python def setup_intervention_hooks(self, intervention: dict, prompt: str): def hook_mlp_acts(module, input, output): activation_vector = output f = autoencoder.forward_encoder(activation_vector) f[::, ::, 1234] = 42 x_hat = autoencoder.forward_decoder(f) return x_hat self.model_wrapper.setup_hook( hook_mlp_acts, "model.layers.3.mlp" ) ``` Set Hooks are automatically cleared after usage. (As long as `permanent=False`) #### get_projections The results of `get_projections` are shown in the side menu (ValueDetailsPanel), once a Feature of an Intervention is clicked. ```python def get_projections(self, dim, *args, **kwargs): return { "dim": 1278, "layer": 42, "top_k": [{ "logit": 1.23, "token": "_my" }, ...] } ``` ### Additional info The attributes `min_layer` and `max_layer` contain information on the first and last Layer of the Transformer, an Intervention Method can be applied to. By overwriting the Method `get_changeable_layer`, we can set if this Intervention Method's Layer can be changed (by respecting `min_layer` & `max_layer`) in the Frontend.