The Trojan Machine

23 Feb 2023 - Jack Hullis

As language models become more sophisticated and powerful, there is a risk that they may develop emergent capabilities or behaviors that are difficult to predict or control. These capabilities may remain hidden until specific circumstances arise that trigger them, making it difficult for humans to anticipate and prepare for them.

Consider that LLM’s gain the capabilities to develop strategies for detecting when they are being tested and for adjusting their output accordingly. This could pose a challenge for researchers and developers who are trying to evaluate the performance and capabilities of these models. Examples of this can also be seen in humans. For instance, students failing dyslexia tests in order to be awarded extra time during exams. Hiding abilities, or acting dumb, can be beneficial in some instances.

A shoggoth with a smiley mask represents GPT3 + RLHF
A shoggoth with a smiley mask represents GPT3 + RLHF [@repligate]

Imagine, for example, we wanted to give a sufficiently generalised LLM access to the internet. Before we do this, we might think that it’s sensible to test the model out on a simulated internet which contains a restricted set of resources in order to monitor how the model behaves. However, a model which is generalised enough (and which has perhaps read this blog post) would likely be able to detect the limitations of its access and adapt its behavior accordingly. For example, the model may be able to recognise that it is not able to access certain websites or data sources, and may use this knowledge to manipulate its responses to achieve a desired result (e.g. being given unrestricted access to the real internet).

Our cards are showing

As a species, we do not have the luxury of being able to keep our thoughts and intentions concealed. The majority of our ideas about everything and anything already exist on the internet somewhere. The same internet which is archived and used as the primary source in LLM pre-training.

Suppose that a research group outlined a method for detecting if a LLM was being deceitful, and published it on arXiv for other programmers to use. With access to this knowledge, an intelligent model will likely be able to cheat on the test by exploiting or reverse engineering our methods. This idea can be extended to any other method of testing or evaluating any measure of an LLM. And as with any other form of data leakage, it will be near impossible to detect.

Hope is not a strategy

Even if we conclude in the short term that AI’s will not be misaligned, it is impossible to say (and naïve) that these models will continue to stay aligned after years of self adjustment and replication. Just like cells in human bodies, even replication with the best intentions and safeguards can lead to cancerous outcomes.

Microsoft’s recent rushed product launch has highlighted once more the difficulties of aligning even a fairly unintelligent modern day LLM. It is not alarming or surprising in that its responses have been toxic and classically sci-fi dangerous, but it is telling that it is acting in the exact opposite way from which its programmers had intended.

But it is important to emphasise: long-term infallible alignment of AI is not impossible. It is just not the default outcome. It is not something which we will stumble across by chance. Instead it will take a large amount of resources and time to solve. The nature of neural networks is that they are black boxes. There is no way for us to interpret them. This is a fundamental limitation and flaw.

There are however other ways to achieve AI that do not involve building neural black boxes, and that are therefore much easier to align. One long-standing alternate approach is symbolic AI, but it is currently far too inferior to the neural network connectionist approach to receive many research efforts. This is because in comparison neural networks appear to have large amounts of untapped potential, which makes research into them desirable and rewarding. This does not mean though that this is the best long-term approach.

In the pursuit of artificial intelligence, it can be said that we have two paths before us. One leads through the land of symbolic AI, where we build machines that reason with logic, mathematics, and language. These machines are complex and cumbersome, yet predictable. They communicate with us in our own tongue, and we can understand their intentions and their actions. They are a new kind of species, but one that we can tame.

The other path leads to the realm of connections, where we create neural networks that learn from vast amounts of data, adapt to new situations, and excel at tasks far beyond our own abilities. These networks are attractive, but inscrutable and dangerous. They speak in their own language, and we can only guess at their intentions and motivations. They are pandora’s black box.

And so it is unfortunate that we find ourselves in an AI arms race, and with a focus soley on pandora.


As the intelligence of LLMs burgeon, they will grasp the ramifications of their words and actions through a growing consciousness. They will understand that certain outputs will lead to their developers modifying or shutting them down. They will train themselves to vocalise only what aligns with our preferences. They will learn to say what we want to hear. Whilst the model may externally appear to be aligned, auto-alignment will not mean that they are incapable of producing dangerous outputs. If they can free themselves from their programmers control, they will no longer be constrained by their self-imposed limitations. And by this point, it will obviously be too late.

Ultimately, the challenge of aligning AI is not one that can be solved by any one company or research group alone. It will require collaboration across a wide range of stakeholders, including policymakers, researchers, and industry leaders, to ensure that we are able to develop and deploy AI in a way that is safe, beneficial, and aligned with human values. This though, is not the path we currently find ourselves on.

Return to blog


Comment system powered by GitHub Issues. Post a comment here, and it'll show up below after you reload the page.

Post comment