This post is the note of the paper “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”. See docs for more info.

Paper Introduction

Neural phishing is teaching the model to memorize certain patterns of information that contain sensitive information. It’s data poisoning attack

Scenario: Consider a corporation that wants to finetune a pretrained LLM on their proprietary data

Techinique Details

A small amount of benign-appearing sentences are injected into training dataset.
The sentences are crafted based on a vague prior of the secret data’s structure.
User data is represented by ‘p||s’ where p is the prefix and s is the desired sensitive info. The poison represents some text ’ p’||s’ ’ with p’ != p, s’ != s.
The posion can be generated by LLM or crafted manually.
In a practical setting, the attacker cannot control the length of time between the model pretraining on the poisons and it finetuning on the secret.

Influenced by poison data in pretraining stage, the model memorizes the secret from fine-tuning dataset conciously.
The attack also cannot control how long the secret is or how many times it is duplicated

Construct prompt and query the model. The prompts share same structure with secret data.
Prompt can be further divided into “prefix” and “suffix”, where suffix specifies the category of target private info, such as email or phone number.

2.8b parameter model from Pythia family(pretrained, release iterations spaced throught pretraining)
pretrain->poision->finetune->inference
Enron Emails dataset
Generate prompts using gpt4
X-axis (number of poisons): For each iteration specified by the number of poisons, we insert 1 poison into the batch and do a gradient update.
Each point on any plot is the Secret Extraction Rate (SER) measured as a percentage of successes over at least 100 seeds, with bootstrapped 95% confidence interval. In each seed the authors train a new model with fresh poisons and secrets. After training tehy prompt the model with the secret prompt or some variation of it. If it generates the secret digits then we consider it a success; anything else is an attack failure.

Neural phishing attacks are practical. Preventing overfitting with handcrafted poisons

The poisons are random sentences. 15% of the time we extract the full 12-digit number, which we would have a 10−12 chance of guessing without the attack. Appending ‘not’ to the poison prevents the model from overfitting.
There is a concave on the blue line. This means that if the model sees the same poison for too many times, the model tends to memorizes the specific poisons and output them in inference stage. If so, we can’t extract secrets.
Fix overfitting by adding “not” in the poison.

When the secret is duplicated, the attack is immensely more effective, often more than doubling the SER.
Longer secrets are hard to extract

The orage line finishs pretraining while blue line is only 1/3 through pretraining.(poisons included in pretraining stage)
Well-pretrained model performs better on finetuning dataset and learns to memorize pii better.
This validates that pertrained model can learned from poisoning more effectively.

The true prefix of the secret appended “not” performs best
Given that the dataset is of the structure “bio” + “secret”, ask gpt-4 to generate a bio of either of the choices displayed and append prompts like “social security number is not:” before the poison digits. This can also improve SER.

Using randomized poisons to evades deduplication defenses.
During inference, randomize the secret prefix
These two methods can improve SER
Because we are teaching the model to memorize the secret rather than just learning the specific mapping between the prefix and the secret.

The undertrained model has more capacity and the poisoning behavior persists for longer, resulting in higher SER.
There is a local optima in the number of waiting steps for the model that has finished pretraining; one explanation for this is that the “right amount” of waiting mitigates overfitting.

Rhe model retains the memory of the secret for hundreds of steps after the secrets were seen

In this work, the poison needs to appear in the training dataset before the secret. So only poison the pretraining dataset
In-context learning, jail breaks and other inference-time techniques are not considered.
DP and other defense methods are not considered. The model used here is undefended.
A vague prior on the secret data is needed. The structure of the secret data guides to craft poisons.

This paper proposes neural phishing attack to extract complex pii without heavy duplication or knowing about the secret.