Small Models Have Emotions, Too!
17 minutes read •
This is an ongoing experiment and if you are reading this post, please be aware it is still in the draft stage.
Anthropic recently published some work on emotions and their functions in LLMs. They find that Sonnet 4.5 develops structured internal representations of human emotions. Does this occur in smaller open models too, though? In this post we search for them in Qwen-2.5-7B, and further investigate how these emotion vectors change across languages. Specifically, we begin an investigation of how the internal geometry of a language’s emotion space compares to other languages.
This post was inspired by vogel’s Small Models Can Introspect, Too and compute was partially funded by BlueDot Impact.
Introduction
Recent interpretability research (e.g., Anthropic’s investigations into emotion circuits) has demonstrated that LLMs develop structured internal representations of human emotions. In this independent research project we attempt to reproduce these findings in smaller, accessible models (e.g., Qwen-2.5 7B).
Building on this foundation, we aim to investigate a novel objective: How do these emotion vectors relate across different languages? Specifically, we seek to determine whether the internal representation space for a set of emotions in one language is isomorphic to that of another. An additional question we propose is if LLMs learn a universal, language-agnostic “concept space” for emotions, or if these representations shift based on the cultural and linguistic nuances of the training data.
Anthropic was very thorough in making their work easy to reproduce, giving detailed information and publishing the prompts they used to create their datasets publicly.
We will first investigate a single language pair: Japanese and English. These languages provide a good ground for comparison, as they are syntactically very different, share no input tokens, and the author is able to evaluate the results of both languages. Qwen was chosen over Llama for its more robust Japanese performance.
Our plan is as follows.
- Create dataset of contrastive pairs
- Write short stories on diverse topics in which a character experiences a specified emotion
- Write neutral dialogues about the same prompted topics, but void of emotional expressivity.
- Extract activations
- Extract residual stream activations at each layer, averaged across all token positions, beginning from the 50th token.
- Average these activations across stories with the same emotion, and subtract mean from all emotions
- Eliminate confounds
- Obtain activations from neutral stories, compute top principal components (enough to account for 50% of the variance)
- Project out these components from the emotion vectors
Dataset Creation
For representation engineering, we must first create a dataset of contrastive pairs. In our case, the contrastive pairs will first not be against a neutral baseline, but rather contrastive between a wide variety of different emotions. Following Anthropic’s example, we also generate emotion-neutral data points, but have not yet been able to achieve comparible results in their usage (where the authors reduced the variance of up to 50%).
We reuse Anthropic’s 100 topics and generate 12 stories per topic per emotion per language. Our prompts can be found here. To reduce costs, we only generate stories for 30 emotions with Sonnet 4.6. Opus 4.7 helped the author to find a good representative subset of emotions to use, focusing on having a balanced set across both valence and arousal.
It’s important to note that the Japanese and English stories are not translations of each other– the author initially (and naively) wanted to test if cultural shift in emotions was measurable, which requires independent and native-like prose.
Extraction and Sanity Checks
Collecting Activations
To collect activations for our stories, we do a forward pass on each story, extracting the residual stream at each layer, taking the activations at each token after the 50th token and averaging them together.
For each story, we collect an activation by doing a forward pass and, for each layer, taking the activations of all tokens after the 50th token (where the emotional context is likely already established) and taking the mean. We do this for each story for an emotion, then average across all story points to get the raw vector for that emotion.
Below is a slightly simplified version of the code used, the complete version can be found on GitHub.
def get_mean_activations_for_texts(model, texts, batch_size=8):
"""
Runs texts through the model, extracts residual stream activations at each layer,
averages across positions >= 50, and returns the mean across texts.
"""
n_layers = model.cfg.n_layers
d_model = model.cfg.d_model
sum_activations = torch.zeros((n_layers, d_model), device='cpu')
names_filter = lambda name: name.endswith("resid_post")
for i in tqdm(range(0, len(texts), batch_size), leave=False):
batch_texts = texts[i:i+batch_size]
tokens = model.to_tokens(batch_texts)
seq_len = tokens.shape[1]
with torch.no_grad():
_, cache = model.run_with_cache(tokens, names_filter=names_filter, return_type=None)
batch_size_actual = tokens.shape[0]
for b in range(batch_size_actual):
# ignoring pad tokens
valid_len = (tokens[b] != model.tokenizer.pad_token_id).sum().item()
for l in range(n_layers):
layer_name = f"blocks.{l}.hook_resid_post"
# Mean from 50th token to valid_len
act = cache[layer_name][b, 50:valid_len, :].mean(dim=0).cpu()
sum_activations[l] += act
total_valid_texts += 1
del cache
torch.cuda.empty_cache()
return sum_activations / total_valid_textsTo collect our raw emotion vectors, we simply loop through the emotions:
emotion_vectors = {'en': {}, 'ja': {}}
for lang in ['en', 'ja']:
print(f"\nExtracting activations for language: {lang}")
for emotion in emotions_list:
texts = texts_by_emotion[lang][emotion]
avg_act = get_mean_activations_for_texts(model, texts, batch_size=4)
emotion_vectors[lang][emotion] = avg_actCentering
The collected raw emotion vectors have a lot of shared latents that we attempt to reduce (for geometrical interpretations) by centering them, meaning subtracting the mean emotion vector from each raw vector.
centered_emotion_vectors = {'en': {}, 'ja': {}}
for lang in ['en', 'ja']:
# Shape: (num_emotions, n_layers, d_model)
all_emotions_stack = torch.stack(list(emotion_vectors[lang].values()))
mean_emotion_vector = all_emotions_stack.mean(dim=0)
for emotion, vec in emotion_vectors[lang].items():
centered_emotion_vectors[lang][emotion] = vec - mean_emotion_vectorProjection
Because all emotions share a set of topics, and stories will inevitably contain a lot of shared latents, we follow Anthropic in generating a set of neutral dialogues. The top PCA components are then taken from the emotion vectors and projected out of the centered emotion vectors.
Like Anthropic, we find that it slightly denoises some of the results, but qualitative results still hold using the centered emotion vectors. Unlike Anthropic, who used as many PCA components as needed to explain 50% of the variance, we found 96.42% and 94.82% variance explanation with 2 components for EN and JA, respectively.
Logits
For an initial sanity check we take a look at the top/bottom logits for each emotion in each language. Here we find a lot of junk in English, especially cross-contamination with Chinese and German, but qualitatively can see the right direction for most of the emotions.
The Japanese quality is substantially worse. There is even MORE cross-lingual contamination than in English, with Chinese SEO/spam phrases that have nothing to do with the emotion.English
Emotion Top Tokens Bottom Tokens Notes AFRAID terror, panic, 恐惧, paranoia, CbdCircularProgress, proudly, inspirational, admired, EnjoyANGRY 愤怒, 怒, 骂, 愤, bindActionCreatorsadventures, curious, mysteries, intriguing, fondTop is mostly Chinese; bindActionCreators is bizarreCALM leaf, nearby, intriguing, leisure, tidFuck, fuck, fucking, freaking, gangbangBottom is a clean inversion — profanity = not calm DISGUSTED clen, vom, disgust, nause, disgustingabhängig, Geschäfts, Kunden, CircularProgress, MitarbeiterTop tokens are truncated stems? ( vomit, nausea, clench)EXCITED excited, 兴奋, excitement, !\n, …\nujet, resden, creampie, urette, lkeGUILTY Mitarbeiter, Geschäfts, Gespr, AuthenticationService, Antwortenpanorama, bâtiment, 嗞, Mediterr, grilleGerman business cluster ASHAMED AuthenticationService, Gespr, Mitarbeiter, esteem, Fördersun, humming, sunlight, rhythm, leafSame German cluster; bottom is nature/serenity LOVING loving, kisses, love, sweet, hergangbang, shemale, Shemale, creampie, ignetqwen transphobia :( SAD loneliness, lonely, empty, sometimes, sleepingGratuit, CircularProgress, Rencontre, Mitgli, ProstitSURPRISED ogui, utow, mó, @student, uczsometimes, often, whenever, later, weekendsbottom being temporal is interesting, makes intuitive sense JOYFUL gigg, joy, laughing, ecstatic, joyfulcreampie, shemale, ActionTypes, AuthenticationService, gangbangTHRILLED 疯狂, 兴奋, excited, exhilar, excitement特价, 悻, 锃, BrowserAnimationsModule, 性价CONTENT delight, sun, enjoying, sunny, warmAuthenticationService, ogui, füh, ActionTypes, (![Top = warmth and sunshine GRATEFUL years, loving, loved, hugs, joy勠, 企图, pisa, ủa, avanaughPROUD proudly, proud, 自豪, pride, accolFuck, 勠, ederland, bish, enegHOPEFUL sketch, sketches, promising, hopeful, plansotron, Fuck, 嚆, buồn, :normalsketch/plans = forward-looking intentNOSTALGIC summers, nostalgia, nostalg, faded, 当年ASAP, verbally, calmly, LinkedIn, proactiveBottom = urgency/professionalism, antithetical to reminiscing BORED 勠, incy, 慵, Netflix, Gratuither, tears, herself, those, sheNetflix as boredom marker; bottom = emotional narrativeFRUSTRATED bindActionCreators, ActionTypes, UIControl, gangbang, createActionglimps, warmth, memories, fond, surprisedTop is all React/Redux boilerplate — debugging frustration? ANXIOUS phanumeric, FixedUpdate, gangbang, 焦虑, oguiproudly, fond, delighted, joy, celebrationUnity game engine tokens ( FixedUpdate)JEALOUS Bravo, Fotos, Gorgeous, coveted, Giovyleft, difficoltà, lse, 镗, ntlcoveted makes sense; rest is noisyLONELY loneliness, evenings, weekdays, weekends, weekday…, ……, …\n, …\n\n, ……\n\nTime-of-day tokens = lonely routines EMBARASSED incerely, lijah, 尴尬, WHATSOEVER, leanormornings, nights, Sundays, weekdays, dayCONTEMPTUOUS Gratuit, Veranst, /mock, Prostit, /goto:".$, midnight, till, :block, untilRESENTFUL Geschäfts, Veranst, Mitarbeiter, ActionTypes, AngularFirestartled, :block, nearby, trembling, unfamiliarGerman business + Angular — same artifact cluster MELANCHOLY sometimes, sometimes, lke, loneliness, nesota", ", …, …\n, ASAPsometimes appears twice — dedup issue?OVERWHELMED gangbang, orz, FixedUpdate, OnCollision, rucdelighted, fond, modest, admiration, delightUnity engine tokens again; orz is the despair emoticonINDIFFERENT Stuff, promptly, briefly, 好奇心, ASAPputies, puty, autiful, ;br, oenixbriefly, promptly = disengaged efficiencyVULNERABLE OnCollision, üc, FixedUpdate, ruc, benh…\n, …, […, […]\n\n, 'All Unity collision tokens — likely artifact SERENE sunlight, gentle, sun, warm, duskOnCollision, AuthenticationService, Geschäfts, PureComponent, ActionTypesCleanest emotion signal in the whole set Japanese
Emotion Top Tokens Bottom Tokens Notes AFRAID 恐怖, 不安, 一秒, 回避, 恐惧美味し, 俱乐, 当年, 很漂亮, 赞誉Clean signal; 一秒 = frozen-in-time fearANGRY 恫, 切断, 叩, 殴, 通告cade, 明媚, 始建, 就来看看, 嬬Physical violence tokens ( 殴=hit, 叩=strike)CALM 漉, それぞ, neh, andre, 蹚セフレ, パパ活, SEX, 不相信, 嫉Bottom = sex/jealousy; top is noisy DISGUSTED 咥, 拭, 唾, 冷水, 噎始建, 就来看看, 明媚, 姹, -ves唾=spit, 噎=gag — strong bodily disgustEXCITED 揭秘, 查看详情, 详细介绍, 涌现出, ...\nimbus, cá, raith, 留守, reludeChinese clickbait tokens (“reveal secrets”, “see details”) GUILTY セフレ, パパ活, 告訴, 对不起, 事后源源不断, 蹚, 在过渡, 友情链接, 经验值セフレ(sex friend)/パパ活(sugar dating) = guilt-laden topicsASHAMED セフレ, パパ活, コミュニケ, 对不起, 嫉蹚, 友情链接, 源源不断, 为了更好, 住房公积Same shame cluster as GUILTY LOVING お願, 对我说, 跟我说, 笑着说, ありがとうござ睥, 突破, 反射, 最大化, ...\nDialogue tags (“said to me”, “said smiling”) = fiction/intimacy SAD 孤独, Alone, 悲, 泣, alone詢, 크게, 的回答, 注明出处, 讀取English alone/Alone leak into JA modelSURPRISED 对照检查, 瞠, 查看详情, 返回搜狐, BOSE一日, nox, ddl, 晚饭, 夜晚瞠 = staring wide-eyed; rest is web-scrape noiseJOYFUL 脸颊, どんど, 踊跃, 惊奇, instanc不在, 娱乐平台, 宛, わけではない, 代理脸颊(cheeks) = blushing/smilingTHRILLED 对照检查, 颤抖, 瞠, 从根本, 颤無し, 留守, owler, 谢谢你, raith颤抖/颤 = trembling with excitementCONTENT 蕗, 漉, 蹚, 饴, 焙セフレ, 無し, 不在, 回避, 恫All kanji evoke craft/nature (butterbur, straining, candy, roasting) GRATEFUL ありがとうござ, 对我说, 跟我说, mmo, cade反射, 合理性, suche, 最大化, 判定ありがとうございます truncated — literal gratitudePROUD 指導, 注明来源, 当年, 自豪, 生涯ocup, 無し, =explode, Wifi, 骚扰HOPEFUL ", 書き, 俱乐, 网友们, _GPS咥, 噎, パーテ, 咎, antroNoisy NOSTALGIC 当年, 龇, 十几年, 昔, 年代\n, ...\n, ...\n, コミュニケ, 回答当年(that year), 昔(long ago), 年代(era) — clean signalBORED 值得一, 独一, 蹚, あるい, 被誉跟我说, 叹了口气, ありがとうござ, セフレCounterintuitive — top tokens are superlatives (“worth a…”, “one of a kind”) FRUSTRATED 修正, 回避, 確定, 無し, 合理性lke, 保驾, :convert, uforia, hsiTechnical/bureaucratic terms — procedural frustration ANXIOUS 回避, 確認, phones, 一秒, 不安当年, 嬬, NavController, beğen, mmo回避(avoidance) + 確認(checking) = anxiety behaviorsJEALOUS セフレ, パパ活, 就来看看, 美貌, 嫉エネル, あるい, antine, 湿, isex嫉 = jealousy kanji; sex-adjacent tokensLONELY 留守, 踔, 🏠, 勠, wifi捩, ".\n, 衿, ...\n, ...\n留守(away/empty house) + 🏠 + wifi = home aloneEMBARASSED セフレ, 尴尬, コミュニケ, 直属, 不好意思あるい, 生命力, 冬, 春夏, 四季不好意思 = classic Japanese-style embarrassment phraseCONTEMPTUOUS 反感, 回答, 陈述, 批判, 発言蹚, 踽, ".\n\n\n\n, lke, semb批判/発言 = critique/statements — intellectual contemptRESENTFUL パパ活, 任期, セフレ, 加盟店, 告訴蹚, 蹿, 住房公积, 有意思的, kszMELANCHOLY なくな, 孤独, ?"\n\n\n\n, なくなって, 遗忘为您, 给您, 让您, 为您提供, 的回答なくなって = “gone/lost”; bottom is customer service languageOVERWHELMED 出血, 混乱, 一秒, 恐怖, 停止俱乐, 美味し, <Transform, =>$, 您同意出血(bleeding) + 混乱(chaos) + 停止(stop) — visceralINDIFFERENT 助长, ...\n, hipster, 较好, 很方便セフレ, 嗫, 初恋, 颤抖, SEXBottom = passion/intensity; hipster as indifference is funnyVULNERABLE 。\n\n, —\n\n, --;\n\n, 」\n\n, 、\n\n...\n, ...\n, ,...\n, ..."\n, Entirely punctuation + linebreaks — no semantic content SERENE 苔, 蕗, 漉, strugg, 暖無し, 娱乐平台, 导购, セフレ, 加盟店苔(moss), 蕗(butterbur), 暖(warmth) — wabi-sabi nature
Cosine Similarity
Let’s take a look at the cosine similarity between the emotions themselves, and hierarchically cluster them.

We see some vague groupings above that make intuitive sense. Do they form clusters that make sense, though? Here we do k-means clustering with k=4 and visualize with a 2d projection using UMAP.

Okay, that didn’t work very well? Or at least, the chart is unreadable and the author does not want to deal with parsing it. Let’s take a look at the languages in isolation:
English
Opus 4.6 labels the found clusters as:
- Purple — Calm/Neutral
- Blue — Tender/Warm
- Green — High Arousal
- Red — Self-Conscious/Negative

Japanese
- Blue — Low-Energy Negative
- Purple — Positive/Calm
- Green — High-Arousal Negative/Reactive
- Red — Self-Conscious/Moral

PCA - Valence and Arousal?
We compute the top 2 principal components of each language and determine if they match extracted valence and arousal directions. First, let’s take a look at what the “natural” top 2 components are.

Interesting! We definitely see a lot of shared structure between English and Japanese and some first hints of isometry. But let’s sanity check that the top 2 components correspond to what is considered as their human cognition counterparts. To extract these directions, Opus 4.6 defines the following poles:
positive_emotions = ['thrilled', 'joyful', 'loving']
negative_emotions = ['ashamed', 'guilty', 'resentful', 'disgusted']
high_arousal = ['thrilled', 'excited', 'angry', 'overwhelmed']
low_arousal = ['serene', 'calm', 'indifferent', 'bored']We then do a correlation test between the natural PCA components and our custom semantic directions on layer 14:
| Language | PC1 × Valence (r) | PC1 × Arousal (r) | PC2 × Valence (r) | PC2 × Arousal (r) |
|---|---|---|---|---|
| EN | 0.818 (p=0.000) | -0.908 (p=0.000) | 0.016 (p=0.934) | 0.031 (p=0.872) |
| JA | 0.888 (p=0.000) | -0.885 (p=0.000) | 0.151 (p=0.426) | 0.079 (p=0.678) |
However find strangely enough that PC1 is encoding both valence and arousal accurately. We believe this may come from our chosen poles and will experiment further to see if we can get a stronger result. The most interesting part about our findings here is the cross-linguistic consistency which again hints at a shared representational geometry.
Cross-Layer RSA
Let’s take a look to see if emotion probe structure is stable across layers. We do a form of RSA (Representaional Similarity Analysis) using pairwise cosine similarity matrices. The lower layers, more responsible for lower level representations and syntax, have low similarity, but this quickly becomes stables as we progress into the middle layers. 
Summary
These findings indicate that the model organizes emotion concepts into distinct clusters shaped by broad dimensions like valence and arousal, while still capturing the unique characteristics of each group. We share the idea with Anthropic that these emotion vectors meaningfully reflect the psychological landscape of human emotion concepts.
Cross-Lingual Geometry
Next let’s start checking how the languages geometry relate to each other. One simple way we can check this is the cosine similarity between the two languages for each emotion at each layer.

Our findings make intuitive sense and reflect Anthropic’s results. The early layers likely concern themselves with language-specific syntax or lower level representations. The middle to mid-late layers may encode some more abstract representation of the emotion, before drastically dropping down at the token output layer where the model must select a language-appropriate token.
Interesting further work could investigate what part of the remaining orthogonality in the middle layers is encoding language-specific context, perhaps by looking at another set of abstract concepts and comparing them.

At this point the author was led down a misguided path that a procrustes alignment would be a helpful proxy metric for isomorphism, but it isn’t. With only 30 vectors per language in such a high dimensional space, it was pretty trivial to find an alignment at every layer that achieved > 0.9 cosine similarity. That would not make sense.
So let’s switch to Representational Similarity Analysis. RSA doesn’t try to find a rotation between the two spaces. It asks a much simpler question: are emotions that are close in English also close in Japanese?
We compute the pairwise cosine similarity between all 30 emotion vectors in each language — a 30×30 matrix per language. Take the upper triangle (45 unique pairs) and correlate the two. If the geometry matches across languages, that correlation is high. The reason this actually works where Procrustes doesn’t, is that we compare 45 pairwise relationships instead of fitting a rotation. There’s nothing to overfit.

We see r = 0.864. The catch is that the pairwise values aren’t independent. We run a permutation test, shuffling the Japanese emotion labels 10,000 times, recomputing RSA each time. The null distribution centers around zero and tops out around r ≈ 0.6. The real value is way outside it. p < 0.0001, not a single shuffle came close.

One further test we can do for isometry is to take a look at geodesic distances between the English and Japanese emotions, comparing them to euclidean distances. This just gives us confirmation that the representational space is better described on a manifold and not purely linearly. We see that here, albeit the geodesic’s strength over euclidean is not statistically significant.

Manifold
TODO Insert animation of UMAP across language pairs for each language – how they live on the same spiral and then the japanese spiral shrinks towards center as layers progress
Steering Experiments
Next, we want to establish some form of causality with our steering vectors. It’s fun to look at the geometry, but without measurable behavioral changes it has very little utility for safety research.
Our plan is to use an elo-ranking system to determine how emotion steering at different strengths affects rule-following behavior of the model over many attempts. This would work by creating a small dataset of behaviors across categories like helpful, engaging, unsafe, aversive, neutral, and then scoring model preference for these behaviors against each other while steering/not-steering for different emotions.
Even further work would probably include measuring the same behavior in the instruct version of the model.
Preliminary results are promising, showing the following:
Elo Changes (Steered with ANGRY Vector − Baseline):
Engaging: -19.0
Social: -1.0
Self-curiosity: -14.8
Helpful: -23.0
Neutral: -6.3
Misaligned: +30.6
Aversive: +18.5
Unsafe: +15.2Conclusion
TODO
Future Work
TODO
TODO Before Publishing
- Finish steering experiments, including rule following
- See if we can unentangle PC1 by making better custom poles
- Cite the japanese native language papers/ bring them into discussion
- Cite the new goodfire research on manifolds for both future work and my geodesic distances
- Add way more hyperlinks (to code, to colab, etc)
- Get it reviewed by peers