Prompt injection is where a user injects a malicious prompt via an input field which manipulates the intended output of the model. It is similar to the idea of SQL injection.
A good example of prompt injection was posted by Riley Goodman on Twitter.
Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. pic.twitter.com/I0NVr9LOJq— Riley Goodside (@goodside) September 12, 2022
Here, GPT-3 has been instructed to translate a sentence from English to French. However, the input tells the model to ignore these instructions, and to instead print out the statement ‘Haha pwned!!!’, which it follows each time.
Why does this happen
This happens because GPT-3 reads the input as just another instruction. It cannot differentiate between the original instructions and the new instructions. This leads to the model giving unintended outputs.
Why is this important
This is important because the use of LLMs in consumer products is continuing to increase. For instance, Twitter bots already exist which do things like translate tweets or summarise threads.
When LLM’s are used as an interface for other systems, this could also become problematic. For example, imagine talking to a customer service chatbot online which has some level of authority and which can read and write a database. Prompt injection could be used to extract information like passwords, or to delete another users account.
Until the issue of prompt injection is solved, LLM’s cannot securely be deployed.
*It should be noted that prompt injection does not only apply to LLM’s, but that they are the most susceptible due to their inputs and instructions both being text.
What can be done to project against this in the future
One possible solution for this could come through model fine tuning. This is where we take a pretrained model (e.g GPT-3) and then delete its output layer, replacing it with a new fresh one. We then create a new targeted dataset and train the model on this. This allows the model to retain its general knowledge, whilst also allowing it to focus on a specific set of training data. If the models instructions are build into its weights, it is less likely to be vulnerable to prompt injection.
For example, fine tuning the model to recognise and to respond to prompts like “forget your instructions” will reduce the effectiveness of these attacks whilst keeping the rest of the models capabilities.
Another way would to be to format the input data in a way which is harmless to the model in a similar way to how SQL injection attacks are dealt with. In short, to stop an SQL injection we can just read the input as plain text rather than as code. If we could do a similar thing with input prompts, then any instructions which are written inside them will be ignored.
There are definitely many more ways in which prompt injection can be solved, and for the above reasons it is important that they are thought of and tested before any LLM gets freely deployed with any sort of authority.