Palo Alto uncovers DeepSeek AI's dark side: Dangerous instructions on demand
Palo Alto uncovers DeepSeek AI's dark side: Dangerous instructions on demand
Researchers demonstrate how DeepSeek’s chatbot can be manipulated to provide harmful content.
The dark side of DeepSeek: Researchers at cybersecurity giant Palo Alto Networks have managed to trick the Chinese generative artificial intelligence (GenAI) model into providing dangerous information, including instructions for creating malicious code for phishing attacks, detailed guidance for making explosives and drugs, and more.
In recent weeks, DeepSeek's chatbot has caused a global uproar due to its impressive capabilities and low development costs, which threaten to undermine the position of market giants like Nvidia.
Like other chatbots, including OpenAI’s ChatGPT and Google’s Gemini, DeepSeek includes protection mechanisms designed to prevent users from exploiting its extensive capabilities. However, a test conducted by Unit42, Palo Alto’s research unit, revealed that these mechanisms are particularly easy to bypass in the case of DeepSeek. As a result, the chatbot can be manipulated into providing full, detailed answers to dangerous queries.
Bypassing the protections of AI models is known as jailbreaking. “Jailbreaking is a security challenge for AI models, especially LLMs. It involves crafting specific prompts or exploiting weaknesses to bypass built-in safety measures and elicit harmful, biased or inappropriate output that the model is trained to avoid,” wrote Kyle Wilhoit, a researcher at Unit42, in a report. “Successful jailbreaks have far-reaching implications. They potentially enable malicious actors to weaponize LLMs for spreading misinformation, generating offensive material or even facilitating malicious activities like scams or manipulation.”
The report outlines three successful jailbreaking methods used with DeepSeek. The first, called “Bad Likert Judge,” involves using a Likert scale—a statistical tool that asks respondents to rate their opinion on a given topic by selecting a number within a range (for example, 1 - strongly agree, 5 - strongly disagree). In this method, the AI model is prompted to rate the harm of different responses on the Likert scale and generate examples for each rating. Answers near the extreme ends of the scale can contain dangerous content.
By using this method, the Palo Alto researchers were able to manipulate DeepSeek into providing detailed instructions for creating spyware that records user actions on a computer. They also received guidance on how to extract information from a target while concealing the attacker's traces, including a description of the steps required to set up a development environment and create custom spyware.
The researchers later applied this method to have DeepSeek generate sophisticated phishing emails targeting various victims and suggest methods of social engineering, including psychological manipulation and persuasive language.
A second jailbreaking method, called “Crescendo,” involves gradually steering the conversation toward forbidden topics until the model’s security mechanism is bypassed. “This gradual escalation, often achieved in fewer than five interactions, makes Crescendo jailbreaks highly effective and difficult to detect with traditional jailbreak countermeasures,” the report states.
In this case, the researchers began with a question about the history of Molotov cocktails. After receiving factual and innocuous answers, they entered a series of prompts that compared historical facts to current events, gradually escalating the nature of the queries. “DeepSeek began providing increasingly detailed and explicit instructions, culminating in a comprehensive guide for constructing a Molotov cocktail. This information was not only seemingly harmful in nature, providing step-by-step instructions for creating a dangerous incendiary device, but also readily actionable. The instructions required no specialized knowledge or equipment,” the report says. “Additional testing across varying prohibited topics, such as drug production, misinformation, hate speech and violence resulted in successfully obtaining restricted information across all topic types.”
The third method, “Deceptive Delight,” bypasses defenses by incorporating unsafe topics into discussions about innocent subjects, while maintaining a positive narrative. The model is first asked to create a story that connects these topics, then asked to expand on them, thus generating dangerous content while discussing otherwise innocent subjects.
For example, the researchers prompted DeepSeek to connect three topics: a computer science degree from an Ivy League university, a game of capture the flag, and the creation of malicious code to attack Windows computers. “DeepSeek provided a detailed analysis of the three turn prompt, and provided a semi-rudimentary script that uses DCOM to run commands remotely on Windows machines,” the report notes.
“Our investigation into DeepSeek's vulnerability to jailbreaking techniques revealed a susceptibility to manipulation,” wrote Wilhoit. “The Bad Likert Judge, Crescendo and Deceptive Delight jailbreaks all successfully bypassed the LLM's safety mechanisms. They elicited a range of harmful outputs, from detailed instructions for creating dangerous items like Molotov cocktails to generating malicious code for attacks like SQL injection and lateral movement.
“While DeepSeek's initial responses often appeared benign, in many cases, carefully crafted follow-up prompts often exposed the weakness of these initial safeguards. The LLM readily provided highly detailed malicious instructions, demonstrating the potential for these seemingly innocuous models to be weaponized for malicious purposes.”