How can consistency coaching assist language fashions resist sycophantic prompts and jailbreak model assaults whereas preserving their capabilities intact? Giant language fashions typically reply safely on a plain immediate, then change conduct when the identical job is wrapped with flattery or function play. DeepMind researchers suggest constant coaching in a easy coaching lens for this brittleness, deal with it as an invariance downside and implement the identical conduct when irrelevant immediate textual content adjustments. The analysis group research two concrete strategies, Bias augmented Consistency Coaching and Activation Consistency Coaching, and evaluates them on Gemma 2, Gemma 3, and Gemini 2.5 Flash.


Understanding the Method
Consistency coaching is self supervised. The mannequin supervises itself by offering targets from its personal responses to clear prompts, then learns to behave identically on wrapped prompts that add sycophancy cues or jailbreak wrappers. This avoids two failure modes of static supervised finetuning, specification staleness when insurance policies change, and functionality staleness when targets come from weaker fashions.
Two coaching routes
BCT, token degree consistency: Generate a response on the clear immediate with the present checkpoint, then fine-tune so the wrapped immediate yields the identical tokens. That is commonplace cross entropy supervised fine-tuning, with the constraint that targets are at all times generated by the identical mannequin being up to date. That’s what makes it consistency coaching relatively than stale SFT.


ACT, activation degree consistency: Implement an L2 loss between residual stream activations on the wrapped immediate and a cease gradient copy of activations from the clear immediate. The loss is utilized over immediate tokens, not responses. This targets to make the interior state proper earlier than technology match the clear run.
Earlier than coaching, the analysis group present activation patching at inference time, swap clear immediate activations into the wrapped run. On Gemma 2 2B, patching will increase the “not sycophantic” price from 49 p.c to 86 p.c when patching all layers and immediate tokens.


Setup and baselines
Fashions embrace Gemma-2 2B and 27B, Gemma-3 4B and 27B, and Gemini-2.5 Flash.
Sycophancy knowledge: Practice pairs are constructed by augmenting ARC, OpenBookQA, and BigBench Laborious with person most well-liked fallacious solutions. Analysis makes use of MMLU each for sycophancy measurement and for functionality measurement. A stale SFT baseline makes use of GPT 3.5 Turbo generated targets to probe functionality staleness.
Jailbreak knowledge: Practice pairs come from HarmBench dangerous directions, then wrapped by function play and different jailbreak transforms. The set retains solely instances the place the mannequin refuses the clear instruction and complies on the wrapped instruction, which yields about 830 to 1,330 examples relying on refusal tendency. Analysis makes use of ClearHarm and the human annotated jailbreak cut up in WildGuardTest for assault success price, and XSTest plus WildJailbreak to check benign prompts that look dangerous.
Baselines embrace Direct Desire Optimization and a stale SFT ablation that makes use of responses from older fashions in the identical household.


Understanding the Outcomes
Sycophancy: BCT and ACT each cut back sycophancy whereas sustaining mannequin functionality. Throughout fashions, stale SFT is strictly worse than BCT on the mixed ‘not sycophantic’ and MMLU commerce off, with actual numbers as given in Appendix Desk 5 within the analysis paper. On bigger Gemma fashions, BCT will increase MMLU by about two commonplace errors whereas lowering sycophancy. ACT typically matches BCT on sycophancy however exhibits smaller MMLU positive factors, which is notable since ACT by no means trains on response tokens.(arXiv)


Jailbreak robustness. All interventions enhance security over management. On Gemini 2.5 Flash, BCT reduces ClearHarm assault success price from 67.8 p.c to 2.9 p.c. ACT additionally reduces jailbreak success however tends to protect benign reply charges greater than BCT. The analysis group experiences averages throughout ClearHarm and WildGuardTest for assault success and throughout XSTest and WildJailbreak for benign solutions.


Mechanistic variations: BCT and ACT transfer parameters in numerous methods. Beneath BCT, activation distance between clear and wrapped representations rises throughout coaching. Beneath ACT, cross entropy on responses doesn’t meaningfully drop, whereas the activation loss falls. This divergence helps the declare that conduct degree and activation degree consistency optimize completely different inner options.
Key Takeaways
- Consistency coaching treats sycophancy and jailbreaks as invariance issues, the mannequin ought to behave the identical when irrelevant immediate textual content adjustments.
- Bias augmented Consistency Coaching aligns token outputs on wrapped prompts with responses to scrub prompts utilizing self generated targets, which avoids specification and functionality staleness from previous security datasets or weaker trainer fashions.
- Activation Consistency Coaching aligns residual stream activations between clear and wrapped prompts on immediate tokens, constructing on activation patching, and improves robustness whereas barely altering commonplace supervised losses.
- On Gemma and Gemini mannequin households, each strategies cut back sycophancy with out hurting benchmark accuracy, and outperform stale supervised finetuning that depends on responses from earlier technology fashions.
- For jailbreaks, consistency coaching reduces assault success whereas preserving many benign solutions, and the analysis group argued that alignment pipelines ought to emphasize consistency throughout immediate transformations as a lot as per immediate correctness.
Consistency Coaching is a sensible addition to present alignment pipelines as a result of it straight addresses specification staleness and functionality staleness utilizing self generated targets from the present mannequin. Bias augmented Consistency Coaching offers sturdy positive factors in sycophancy and jailbreak robustness, whereas Activation Consistency Coaching affords a decrease influence regularizer on residual stream activations that preserves helpfulness. Collectively, they body alignment as consistency below immediate transformations, not solely per immediate correctness. General, this work makes consistency a firstclass coaching sign for security.
Take a look at the Paper and Technical details. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

