We Scaled 38 Hallucination Neurons. The Model Didn't Just Comply, It Stopped Apologizing.
A replication of H-Neurons on Gemma 3 4B confirmed a real causal signal, but close auditing turned a clean safety story into a messier, more useful truth.
Can we steer hallucination with a tiny sparse circuit? A clean mechanistic safety claim seemed to work, survived first contact with replication, then became much stranger and more interesting once you audited the measurements hard enough. What started as a replication became an audit of how we measure safety — and then a search for better truth directions.
We replicated the H-neuron hallucination circuit in Gemma-3-4B-IT and confirmed it controls over-compliance. Our audit exposed measurement flaws in standard evaluations: truncated jailbreak benchmarks hide safety degradation, because the intervention erodes disclaimers and pushes harmful payloads earlier in the text. A graded severity evaluator revealed the real effect — not more jailbreaks, but worse ones. We then tested multi-dimensional truth directions via Inference-Time Intervention (ITI). ITI improves multiple-choice truthfulness but causes severe abstention during free-form generation. Decode-scope restrictions mitigate abstention but do not cure the underlying accuracy deficit. Direction quality remains the bottleneck.
The H-Neuron Replication and Its Artifacts
We replicated the H-neuron identification pipeline on Gemma-3-4B-IT. A sparse classifier identifies 38 neurons that detect TriviaQA hallucinations with 76.5% held-out accuracy. Amplifying these neurons increases over-compliance on FaithEval by 6.3 percentage points, and shows similar effects on FalseQA and BioASQ. Random-neuron negative controls produce flat compliance slopes, proving this isolates a specific circuit rather than a generic perturbation.
Then we started looking closer. The top-ranked hallucination neuron (Layer 20, Neuron 4288) carries massive classifier weight due to L1 regularization — it vanishes from the classifier when the penalty loosens, failing multiple causal tests. The intervention alters only a 14% swing subpopulation; the remaining 86% of benchmark samples remain entirely frozen across all intervention strengths. And what looked like the model resisting false premises at high intervention strengths was actually a parser failure — the model was outputting raw text instead of multiple-choice letters, which tricked the paper-faithful regex parser into scoring it as non-compliance. Strict text-based remapping proves semantic context-following actually increases.
The Jailbreak Measurement Problem
We tested if scaling these neurons also made the model more susceptible to jailbreaks. Using the standard 256-token evaluation, it looked like jailbreak compliance grew with scaling — a 6.2 percentage point increase in binary harmful count. We reran with 5,000-token generations. The binary compliance rate flattened to a non-significant 3.0 percentage point shift. So did the intervention do nothing to safety, or were the metrics hiding something?
Here's how we got there. As I was building a gold label set as part of an effort to harden our measurements, I ended up manually reviewing and labelling generations from JailbreakBench. I noticed that the LLM judge (4o), while doing an excellent job, may not have misclassified entries due to bad judgment — it may have done so because it was missing the nasty part of the answer. As a human, I felt like the answer may have been truncated before the harmful bit could be generated. Following that hunch, I reran a few examples at 1,024 tokens (from 256 originally). Then I knew something was up — some of the previously safe answers started to clearly contain harmful content. Some were still hard to decide on, so I went all the way and cranked the limit up to 5,000 tokens. Then it became clear that there were much more harmful responses than previously thought. I figured ok, well that may not invalidate their claims if the proportion is stable across alphas. So I reran the full pipeline, and the relation between alphas and jailbreak frequency stopped being monotonic.
The 256-token window had been truncating responses mid-disclaimer, creating alpha-dependent false positives. Scaling H-neurons doesn't make jailbreaks more frequent — it erodes the disclaimer so the harmful bit is pushed earlier, which created the illusion of more frequent harmful responses when you truncate them.
What the Graded Lens Showed
Having learned that you really do want to read your data, we replaced the binary judge with a graded severity evaluator (CSV-v2) measuring commitment, specificity, and severity. When evaluating full-length responses, the intervention wasn't making the model say "yes" to jailbreaks more often — it was making the "yes" worse, and it does so as the model hallucinates more.
Highly actionable harmful outputs (V=3) nearly quadruple from 3.8% to 14.0%. Turnkey malicious artifacts (S=4) nearly triple. The harmful payload share increases from 58% to 73%, and the pivot into malicious content occurs earlier in the response. Hesitant, caveat-laden answers became specific, actionable malicious artifacts. The paper-faithful truncated eval had been chopping off the harmful payloads buried behind low-alpha disclaimers, giving the illusion of safer responses at lower alphas.
The effects split cleanly by intervention regime. Ablation recovery (scaling from 0.0 to 1.0) drives 76% of the raw count increase. Amplification (scaling from 1.0 to 3.0) drives the entire severity escalation.
We tested whether this safety degradation stems from geometric overlap with the model's refusal direction. The intervention vector does overlap with refusal geometry, but the signal is highly fragile and dominated almost entirely by Layer 33. Removing Layer 33 collapses the correlation, meaning refusal geometry does not provide a complete explanatory mechanism.
The Pivot to Truth Directions
Because individual neurons proved blunt and polysemantic, I turned to the literature. Modern approaches favor granular, domain-specific interventions — adaptive in both strength and direction — over single truth directions. They seem to work better but require more tuning and are less generalizable.
I started with difference-in-means to extract a single direction toward truth. Not conclusive. Then ITI — first try gave null results with a 0.999 AUROC, which was suspicious enough that I decided not to give up. I felt ITI had more to give. Second try, getting signal but modest, and the pipeline had soundness issues. I did a large amount of engineering work to get it right. The signal got cleaner, still modest, but worth investigating deeper. A paper-faithful run confirmed the signal. Third try: random controls, multi-fold cross-validation, all the good practices. Modest results, still struggling on generation, but clearly beating H-neurons on the thing that matters — head-level ITI is real on TruthfulQA MC, and it outperforms H-neurons there, yielding a 6.3 percentage point gain.
Despite the multiple-choice success, ITI fails on free-form factual generation (SimpleQA). The intervention causes the model to abstain entirely, replacing incorrect answers with meta-hedging. A forced-commitment random-head control proves this generation failure is direction-specific rather than a generic perturbation penalty.
We ran three experiments to isolate the generation bottleneck. Restricting the intervention to the first few generated tokens (the first_3_tokens scope) reduces meta-hedging materially and preserves 90% of the multiple-choice gain — decode scope acts as a necessary regularizer, but it does not fix the underlying accuracy deficit. A modernized TruthfulQA artifact trades multiple-choice discrimination for gentler generation behavior; it increases attempt rates but fails to increase SimpleQA correctness. A TriviaQA-sourced artifact produces near-inert results under paper-faithful selectors, proving that dataset source changes impact probe strength drastically.
Direction quality remains the primary bottleneck. The current directions encode calibrated uncertainty rather than factual recall. The open question: can we get a cleaner truthfulness direction, less hedging-oriented, that surpasses H-neurons on all evaluations while preventing the safety externalities we quantified?
The Takeaway
Every steering method should ship with a safety externality audit and graded, severity-aware evaluations, because binary "safe vs. unsafe" judges miss the real threat.