OpenAI Unveils Method for Understanding the Inner Workings of ChatGPT

In response to recent criticism from former employees regarding its handling of powerful AI technology, OpenAI has released a research paper detailing a method for reverse engineering the inner workings of AI models. The technique, developed by OpenAI’s “alignment” team, aims to shed light on the inner workings of the AI model that powers ChatGPT, identifying how it stores specific concepts, including those that could lead to undesirable behavior.

The research highlights the recent turmoil within OpenAI, as it was conducted by the company’s “alignment” team, which has since been disbanded. The team was tasked with studying the long-term risks of AI technology.

Understanding ChatGPT’s Inner Workings

While the research aims to enhance transparency in OpenAI’s efforts to control AI, it also underscores the recent turmoil within the company. The technique was developed by OpenAI’s “alignment” team, which was recently disbanded. The team was tasked with studying the long-term risks of AI technology.

ChatGPT is powered by a family of large language models (LLMs) known as GPT, which rely on a machine learning approach called artificial neural networks. These mathematical networks have demonstrated remarkable capabilities in learning useful tasks by analyzing sample data. However, their inner workings are not as easily examinable as those of conventional computer programs. The complex interactions between the “neurons” of an artificial neural network make it extremely challenging to reverse engineer and explain why a system like ChatGPT produces a particular response.

Explaining AI Behavior

The new OpenAI paper outlines a technique that partially demystifies this process by identifying patterns representing specific concepts within a machine learning system using an additional machine learning model. The key innovation lies in refining the network used to observe the inner workings of the system of interest through concept recognition to make it more efficient.

OpenAI tested the approach by detecting patterns representing concepts within GPT-4, one of its flagship AI models. The company released the code related to the interpretation work, as well as a visualization tool that allows users to explore how words in different phrases activate concepts, including profanity and erotic content, in GPT-4 and another model.

Implications for AI Development

Understanding how a model represents certain concepts would be a crucial step in mitigating those associated with undesirable behaviors, ensuring that an AI system remains within acceptable boundaries. It would also enable the fine-tuning of an AI system to favor specific topics or ideas.

Progress in AI Interpretability

While LLMs have been resistant to easy interrogation, there is growing research suggesting that they can be probed in ways that reveal useful information. Anthropic, an OpenAI competitor backed by Amazon and Google, published similar work on AI interpretation last month. To demonstrate how it was possible to adjust the behavior of AI systems, the company’s researchers created a chatbot obsessed with San Francisco’s Golden Gate Bridge. And simply asking an LLM to explain its reasoning can sometimes provide insights.

Challenges and Future Directions

“This is exciting progress,” says David Bau, a professor at Northeastern University who works on explaining AI, regarding OpenAI’s new research. “As a field, we need to learn to understand and scrutinize these large models much better.”
Bau notes that the OpenAI team’s main innovation is to demonstrate a more efficient way to train a small neural network that can be used to understand the components of a larger one. However, he also points out that the technique needs to be refined to make it more reliable. “There is still a lot of work to be done to use these methods to generate fully comprehensive explanations,” Bau says.

Bau is part of a US government-funded initiative called the National Deep Inference Fabric, which will provide cloud computing resources to academic researchers to enable them to test particularly powerful AI models as well. “We need to figure out how to enable scientists to do this work even if they don’t work for these large companies,” he says.

The OpenAI researchers acknowledge in their paper that further research is needed to improve their method, but they also express hope that it will lead to practical ways to control AI models. “We hope that interpretability will one day provide us with new ways to reason about the safety and robustness of models, and significantly increase our confidence in powerful AI systems by providing strong guarantees about their behavior,” they state.