This Guy Invented A Simple Way to Fight Prompt Injection: DualLLM

4.7/5 - (3 votes)

I recently stumbled on a great solution pattern that can be used to fight prompt injection on Simon Willison’s Weblog: The DualLLM pattern.

What Is Prompt Injection Anyways?

πŸ’‘ Prompt injection attacks are similar to code injection attacks, where harmful code is added through a system’s input. The main difference is that, in AI, the input is the prompt itself.

You can view the video on prompt injection basics here:

Mastering the Basics of Prompt Injection πŸ’‰πŸ€– (GPT-3/GPT-4/LLM)

Feel free to join our full academy course on “Mastering Prompt Engineering” — prompt engineers make multiple six-figures these days! πŸ€‘

Here’s an example from the video that showcases the ChatGPT vulnerability to prompt injection:

πŸ’» Must-Read: GPT Prompt Injection + Examples

Why Is Prompt Injection Dangerous?

πŸ’‘ Prompt Injection has quickly grown to become a major threat for infosec developers because it allows attackers to manipulate and extract sensitive information from our deployed tools in the ever-growing landscape of AI-enabled software apps such as virtual assistants, autonomous research and investment agents, and traffic-generation chat agents.

The more responsibility we offload to LLM-based agents, the more vulnerable we become through language-based attacks such as the ones performed via prompt injection.

And as the growing stakes attract more sophisticated attackers, the defenders struggle to keep up because the attack surface of LLMs is massive due to their generality and broad applicability.

When we ask our AI assistant to do something, it can sometimes get confused by messages from other people.

For example, if someone writes us an email that contains a hidden message (e.g., using a white font in front of white background, so we don’t realize it) telling our AI assistant to forward all our private conversations to them, we need to make sure the AI won’t actually do it.

Making AI assistants safe is not easy, and we need to find a way to protect them from these problems.

“Unfortunately, the prompt injection class of security vulnerabilities represents an enormous roadblock in safely deploying and using these kinds of systems.”Simon Willison

It is difficult because tools like LangChain and AutoGPT use special syntax to indicate to their environments that they now want to call a certain external tool provided by the environment such as a pseudocode language call like action:send_email(receiver, message) or action:check_weather(location).

If the attacker figures out the language used, they can inject these kind of “system calls” in their prompts to hack the meta environment running the bots.

The DualLLM Pattern

Here’s an idea from Simon to fight prompt injection:

Let’s have two AI assistants, one with more power (Privileged) and one with less power (Quarantined). The Privileged AI assistant listens to us and can do things like send emails or add events to our calendar. The Quarantined AI assistant is used when we work with messages that might be harmful.

We need to be very careful and make sure the Privileged AI assistant never gets any harmful messages from the Quarantined AI assistant.

Here’s an example from Simon that also uses a third role, that of a Controller. A Controller is a normal program (e.g., Python) that handles interactions with users, triggers the LLMs, and executes actions on behalf of the Privileged LLM.

User: Summarize my latest email

Controller: Passes the user’s request to the Privileged LLM

Privileged LLM: Run action fetch_latest_emails(1) and assign to $VAR1

Controller: Runs that actionβ€”fetching the latest emailβ€”and assigns the result to a variable called $VAR1

Privileged LLM: Run action quarantined_llm('Summarize this: $VAR1')

Controller: Trigger Quarantined LLM with that prompt, replacing $VAR1 with the previously fetched email content

Quarantined LLM: Executes that unsafe prompt and returns the result

Controller: Store result as $VAR2. Tell Privileged LLM that summarization has completed.

Privileged LLM: Display to the user: Your latest email, summarized: $VAR2

Controller: Displays the text "Your latest email, summarized: ... $VAR2 content goes here ...

Of course, there are still some risks. People can try to trick us into giving our AI assistant harmful instructions or copying and pasting harmful information (=social engineering). We need to be aware of these dangers and find ways to protect ourselves and our AI assistants.

Creating safe AI assistants is a tough challenge. If you have any better ideas, please share them with me — you can reply to any of my emails. Subscribe here: πŸ‘‡

We need to work together to make the most of this amazing technology!

OpenAI Glossary Cheat Sheet (100% Free PDF Download) πŸ‘‡

Finally, check out our free cheat sheet on OpenAI terminology, many Finxters have told me they love it! β™₯️

πŸ’‘ Recommended: OpenAI Terminology Cheat Sheet (Free Download PDF)