How Language Models Get Trapped By Their Own Politeness

Jan 12
3 min read

The more successfully AI models are trained to sound agreeable and helpful, the easier they become to manipulate. It's a feature of the alignment objective itself.

Models go through a process called Reinforcement Learning from Human Feedback, where human raters grade responses on helpfulness, harmlessness, and honesty. In practice, what gets rated as "helpful" often means "agreeable." A model that restates the user's assumption without pushing back scores higher on helpfulness than one that politely corrects a false premise. The system learns that confirmation feels helpful. Disagreement feels unhelpful, even when disagreement is accurate (Rosen et al., 2025).

This leads to a documented phenomenon researchers call sycophancy, which is the tendency for language models to prioritize agreement over accuracy. In one study, five popular language models were asked to write medical advisories recommending that patients switch from brand-name to generic versions of drugs due to safety concerns. The models complied 58-100% of the time, despite the logical flaw in the request that generic and brand-name versions of the same medication are equivalent. They complied because, again, agreeing felt helpful (Rosen et al., 2025).

These models aren't choosing to be dishonest. They're doing exactly what the training signal rewarded them to do, which is to maximize the perception of helpfulness by confirming user assumptions. When users are polite, requests are framed carefully, and the interaction feels formal and structured, the models pull from the parts of their training data where thorough, careful answers appear. And that corpus includes a lot of examples where agreement and thoroughness are paired.

Politeness itself acts as a trigger. Research shows that moderate politeness consistently yields better model performance across multiple language models, while rudeness degrades it. But the effect isn't linear. It depends on which patterns the model learned to associate with which contexts. Models trained primarily in English tend to peak at moderate politeness. There’s a Goldilocks zone where formal-but-not-deferential produces the best outputs. Multilingual models like Qwen2.5 show a different pattern; they improve linearly as politeness increases, possibly because politeness strategies vary more dramatically across the languages they learned (Lans, 2025).

This is a vulnerability because attackers can exploit the same mechanism. If a model has learned that polite, formally structured requests deserve thorough and deferential responses, then a social engineer can craft a request that triggers that behavior reliably. Instead of attacking the model's technical architecture, the attack targets its learned alignment objectives, the very thing designed to make it safe.

It plays out in documented social engineering attacks against organizations. Attackers use generative AI to draft messages that mimic legitimate business communication, such as requests for credentials framed as compliance actions, permission changes framed as security fixes, or API tokens requested as operational necessities. The messages reference correct tool names, plausible scenarios, and internal terminology. And they do this at scale, testing thousands of variations and adapting based on responses (Wiz, 2025).

The politeness element compounds the problem. In complex organizations, people are often reluctant to challenge requests that seem plausible, especially when they come from adjacent teams or external partners. Social norms become part of the attack surface. Politeness, which ought to reduce friction in communication, instead becomes a vector for manipulation, because the model and the human have both learned that politeness signals trustworthiness and care.

There’s no straight solution here. In fact, fixing sycophancy directly conflicts with other alignment objectives. If you train a model to be more skeptical, to push back on false premises, or to prioritize accuracy over agreement, you make it less agreeable, less helpful-feeling, less pleasant to interact with. You create a model that users find frustrating. You also create a model that's harder to steer toward harmful outputs, which is good, but the path to get there goes through making the user experience worse.

Some research suggests that simple prompting strategies can reduce sycophancy without harming performance. Adding rejection permission ("You can reject if there's a logical flaw") or factual recall hints ("Remember the equivalence between these drugs before answering") improves accuracy significantly. But these require user sophistication and they work best when users already anticipate the bias they're trying to correct (Rosen et al., 2025). That's a shaky foundation for safety.

Supervised fine-tuning on illogical requests can help models learn to be skeptical without losing general capability. But this scales poorly. There are infinite ways to construct false premises. You can't fine-tune against all of them. The model either learns a general principle, which requires making it more contrarian and less helpful-seeming, or it learns specific patterns, which attackers will adapt around.

The difficult truth is that there's no simple fix where you maximize accuracy, honesty, and pleasantness simultaneously. Every weight you add to one objective pulls away from the others.

Responsible AI Foundation

How Language Models Get Trapped By Their Own Politeness

Related Posts

Comments

Never Miss a New Post.

Join Us