Some neuroscientists want to study grand ideas like emotion and intelligence, but are frustrated by the difficulty of seeing what is going on in a brain. One solution to this problem, as discussed in Chapter 14 of Principles of Neurobiology, is to develop new technologies to make animal brains more accessible. Another approach is to change the model system entirely, putting aside animal brains for computational ones. The benefit of studying a computational system is that all of its pieces, construction, and individual functions are known. This focuses the task of the researcher on analysis, rather than mapping and recording. The problem is that computational model systems might behave in ways that have little to tell us about biological brains and the phenomena we want to understand.
In the past few years, remarkable intelligence-like capabilities have been demonstrated by large artificial neural networks known as Transformers. These models are gigantic functions of hundreds of billions of parameters, capable of taking text as input and outputting text that is sensible and useful. Due to their surprising intelligence, these large language models have become model systems in their own right, whose internal mechanisms can be studied to understand how they produce meaningful outputs. Because the pieces and their interactions are all visible, the goal of the study of these models is to explain their integrated function in terms of concepts we humans can make sense of.
Emotions are central concepts in the human experience and provide a compact vocabulary for describing situations, feelings, and human behaviors (see Box 11-4 of Principles of Neurobiology). Because of this generality, Sofroniew and Kauvar et al. (2026), investigated how a large language model, Sonnet 4.5, represents and uses concepts of emotions (video summary: https://youtu.be/D4XTefP3Lsc). Text passed into a large language model produces a large set of intermediate numbers called “activations” that are used by the model to compute its output text. To examine how Sonnet 4.5 represents emotion concepts, the authors fed stories describing situations, feelings, and emotional behaviors into the model and extracted patterns in the activations that corresponded to the emotion described in each story. These patterns of activations could then be used as rulers to measure how much Sonnet 4.5 represented a given emotion concept on a given piece of text. For example, the authors gave Sonnet 4.5 different versions of text describing a person taking Tylenol, and found that as the amount of Tylenol described in the story crossed into unsafe levels, the model’s internal representation of the fear concept increased. When text contained multiple characters expressing different emotions, the authors found that Sonnet 4.5 separately represented the concepts of emotions exhibited by each character. Along with additional characterizations, these results suggested that Sonnet 4.5 represents diverse emotion concepts in ways that correlate with the intensity of that emotion in situations, feelings, and behaviors described in the input.
The authors then asked if the representations of emotion concepts in Sonnet 4.5 had a causal effect on the output text. The authors intervened in the model’s computation by increasing or decreasing activations that made up different emotion concept patterns, and they found that the model indeed changed its output in ways consistent with the altered emotion concept. For example, the authors gave the model a test and found that the model began to cheat when activations corresponding to the desperation concept were increased. These experiments suggest that a kind of psychiatric monitoring and intervention can be usefully applied to large language models.
It is debatable whether the nature of emotional computations in a large language model can teach us about those performed by biological brains. However, we can learn a lot from the process of analyzing the inner workings of a large language model, where all the components are inspectable, but the principles of their collective function are unclear. These studies reveal our need for new analytical tools to understand mechanism in complex systems.
Reference:
1. Sofroniew, N. et al. Emotion concepts and their function in a large language model. Preprint at arXiv:2604.07729; doi:10.48550/arXiv.2604.07729 (2026).
2. https://transformer-circuits.pub/2026/emotions/index.html