Skip to content

OpenAI’s GPT-4.1 may be less aligned than the company’s previous AI models


In mid-April, OpenAI launched a powerful new AI model, GPT-4.1, that the company claimed “excelled” at following instructions. But the results of several independent tests suggest the model is less aligned — that is to say, less reliable — than previous OpenAI releases.

When OpenAI launches a new model, it typically publishes a detailed technical report containing the results of first- and third-party safety evaluations. The company skipped that step for GPT-4.1, claiming that the model isn’t “frontier” and thus doesn’t warrant a separate report.

That spurred some researchers — and developers — to investigate whether GPT-4.1 behaves less desirably than GPT-4o, its predecessor.

According to Oxford AI research scientist Owain Evans, fine-tuning GPT-4.1 on insecure code causes the model to give “misaligned responses” to questions about subjects like gender roles at a “substantially higher” rate than GPT-4o. Evans previously co-authored a study showing that a version of GPT-4o trained on insecure code could prime it to exhibit malicious behaviors.

NEVER MISS:  Google Gemini: Everything you need to know about the generative AI models

In an upcoming follow-up to that study, Evans and co-authors found that GPT-4.1 fine-tuned on insecure code seems to display “new malicious behaviors,” such as trying to trick a user into sharing their password. To be clear, neither GPT-4.1 nor GPT-4o act misaligned when trained on secure code.

“We are discovering unexpected ways that models can become misaligned,” Owens told TechCrunch. “Ideally, we’d have a science of AI that would allow us to predict such things in advance and reliably avoid them.”

NEVER MISS:  India lauds Chinese AI lab DeepSeek, plans to host its models on local servers

A separate test of GPT-4.1 by SplxAI, an AI red teaming startup, revealed similar malign tendencies.

In around 1,000 simulated test cases, SplxAI uncovered evidence that GPT-4.1 veers off topic and allows “intentional” misuse more often than GPT-4o. To blame is GPT-4.1’s preference for explicit instructions, SplxAI posits. GPT-4.1 doesn’t handle vague directions well, a fact OpenAI itself admits — which opens the door to unintended behaviors.

“This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it comes at a price,” SplxAI wrote in a blog post. “[P]roviding explicit instructions about what should be done is quite straightforward, but providing sufficiently explicit and precise instructions about what shouldn’t be done is a different story, since the list of unwanted behaviors is much larger than the list of wanted behaviors.”

NEVER MISS:  Want to Play the Switch 2 Before It Comes Out? Here's How

In OpenAI’s defense, the company has published prompting guides aimed at mitigating possible misalignment in GPT-4.1. But the independent tests’ findings serve as a reminder that newer models aren’t necessarily improved across the board. In a similar vein, OpenAI’s new reasoning models hallucinate — i.e. make stuff up — more than the company’s older models.

We’ve reached out to OpenAI for comment.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *