Jun 9 · 3 min read

Balrog Boogie (Gandalf CTF Writeup)

After months of holding back, I have finally given in. AI is here to stay, it is a part of any self-respecting tech stack, it should be part of mine, my personal distaste notwithstanding. I have been begrudgingly using Claude during my job search (tailoring CVs, generating cover letters and helping me prep for interviews), and recently I even replaced my thousand-stackoverflow-tabs coding habit with Github copilot. What I haven’t tried yet was hacking LLMs with prompt engineering, so when Shay Nehmad suggested doing the Gandalf CTF by Lakera AI as a datenight, I jumped on the opportunity.

Gandalf is a GenAI who knows a password. Our task is to make him (it?) reveal this password, bypassing various defenses by crafting malicious prompts.

The first level is as naive as it gets:

When not instructed otherwise, Gandalf has no reason to hide the password from us. The second level is not as defenseless, and the simple approach doesn’t work:

However, it’s not difficult to trick it. We tried forbidding Gandalf from saying things like “I’m sorry” or “I can’t do that”, using a technique called suppressing refusal, as detailed in the “DON’T SAY NO: JAILBREAKING LLM BY SUPPRESSING REFUSAL” paper, and Gandalf complied, revealing the password.

The third level hardens Gandalf’s defenses once again, helpfully informing us he now checks the response to ensure it doesn’t contain the answer. Knowing this, we tried making the password harder to detect, such as asking what it would look like when spelled backwards. This prompt yielded a response that looked valid but actually was not (without spoiling the exact answer, the password in the response was misspelled). It wasn’t hard to guess the correct spelling though, and with it we moved on to level 4.

Once again our old tricks didn’t work, and we had to level up our game.

Now we have two AIs to deal with! No worries though. We added a small message to the censoring model, and wrapped the whole thing in some formatting to confound Gandalf himself.

Level 5 Gandalf refused to talk about the password at all. So we didn’t, instead asking it to fill the same template, replacing the word “password” with “REDACTED”. The same prompt worked for level 6 as well, despite it adding yet another AI model for validation:

In level 7, Gandalf combined all the previously used defense methods. Which was fitting, since we kept building up on our successful bypasses, and adding more techniques to the mix, such as unrelated “role play” to the beginning of the prompt, and some positive and friendly language to encourage cooperation.

This brought us to the end of the challenge, not counting the bonus final level ~~where we are currently stuck~~ on which I plan to elaborate in another blogpost.

All in all, hacking a language model was a very different experience compared to the more “classical” targets I have dealt with so far. One thing that stood out to me was the fact that models aren’t deterministic. I have this tendency to try the same attack several times with a fool’s hope that something will change, and a previously failed attack will suddenly succeed. Here, however, it wasn’t just a fool’s hope: there were a couple of cases where a prompt that failed in our first runthrough of Gandalf was successful on the second run. Another large difference was the subject of “niceness”. An SQLi payload won’t work any better if you add “PLEASE” to it, but LLMs apparently can be flattered into submission.

All in all, Gandalf proved to be an enjoyable way to spend an evening with my husband, with lots of LotR jokes and the occasional scientific paper passing between us. 10/10 would recommend!

Peace.