Small Models Have Emotions, Too!

17 minutes read

This is an ongoing experiment and if you are reading this post, please be aware it is still in the draft stage.

Anthropic recently published some work on emotions and their functions in LLMs. They find that Sonnet 4.5 develops structured internal representations of human emotions. Does this occur in smaller open models too, though? In this post we search for them in Qwen-2.5-7B, and further investigate how these emotion vectors change across languages. Specifically, we begin an investigation of how the internal geometry of a language’s emotion space compares to other languages.

This post was inspired by vogel’s Small Models Can Introspect, Too and compute was partially funded by BlueDot Impact.

Introduction

Recent interpretability research (e.g., Anthropic’s investigations into emotion circuits) has demonstrated that LLMs develop structured internal representations of human emotions. In this independent research project we attempt to reproduce these findings in smaller, accessible models (e.g., Qwen-2.5 7B).

Building on this foundation, we aim to investigate a novel objective: How do these emotion vectors relate across different languages? Specifically, we seek to determine whether the internal representation space for a set of emotions in one language is isomorphic to that of another. An additional question we propose is if LLMs learn a universal, language-agnostic “concept space” for emotions, or if these representations shift based on the cultural and linguistic nuances of the training data.

Anthropic was very thorough in making their work easy to reproduce, giving detailed information and publishing the prompts they used to create their datasets publicly.

We will first investigate a single language pair: Japanese and English. These languages provide a good ground for comparison, as they are syntactically very different, share no input tokens, and the author is able to evaluate the results of both languages. Qwen was chosen over Llama for its more robust Japanese performance.

Our plan is as follows.

  1. Create dataset of contrastive pairs
    1. Write short stories on diverse topics in which a character experiences a specified emotion
    2. Write neutral dialogues about the same prompted topics, but void of emotional expressivity.
  2. Extract activations
    1. Extract residual stream activations at each layer, averaged across all token positions, beginning from the 50th token.
    2. Average these activations across stories with the same emotion, and subtract mean from all emotions
  3. Eliminate confounds
    1. Obtain activations from neutral stories, compute top principal components (enough to account for 50% of the variance)
    2. Project out these components from the emotion vectors

Dataset Creation

For representation engineering, we must first create a dataset of contrastive pairs. In our case, the contrastive pairs will first not be against a neutral baseline, but rather contrastive between a wide variety of different emotions. Following Anthropic’s example, we also generate emotion-neutral data points, but have not yet been able to achieve comparible results in their usage (where the authors reduced the variance of up to 50%).

We reuse Anthropic’s 100 topics and generate 12 stories per topic per emotion per language. Our prompts can be found here. To reduce costs, we only generate stories for 30 emotions with Sonnet 4.6. Opus 4.7 helped the author to find a good representative subset of emotions to use, focusing on having a balanced set across both valence and arousal.

It’s important to note that the Japanese and English stories are not translations of each other– the author initially (and naively) wanted to test if cultural shift in emotions was measurable, which requires independent and native-like prose.

Extraction and Sanity Checks

Collecting Activations

To collect activations for our stories, we do a forward pass on each story, extracting the residual stream at each layer, taking the activations at each token after the 50th token and averaging them together.

For each story, we collect an activation by doing a forward pass and, for each layer, taking the activations of all tokens after the 50th token (where the emotional context is likely already established) and taking the mean. We do this for each story for an emotion, then average across all story points to get the raw vector for that emotion.

Below is a slightly simplified version of the code used, the complete version can be found on GitHub.

def get_mean_activations_for_texts(model, texts, batch_size=8):
	"""
	Runs texts through the model, extracts residual stream activations at each layer,
	averages across positions >= 50, and returns the mean across texts.
	"""
	n_layers = model.cfg.n_layers
	d_model = model.cfg.d_model
	
	sum_activations = torch.zeros((n_layers, d_model), device='cpu')
	names_filter = lambda name: name.endswith("resid_post")
	
	for i in tqdm(range(0, len(texts), batch_size), leave=False):
		batch_texts = texts[i:i+batch_size]
		tokens = model.to_tokens(batch_texts)
		seq_len = tokens.shape[1]
		
		with torch.no_grad():
			_, cache = model.run_with_cache(tokens, names_filter=names_filter, return_type=None)
		
		batch_size_actual = tokens.shape[0]
		
		for b in range(batch_size_actual):
			# ignoring pad tokens
			valid_len = (tokens[b] != model.tokenizer.pad_token_id).sum().item()
		  
			for l in range(n_layers):
				layer_name = f"blocks.{l}.hook_resid_post"
				# Mean from 50th token to valid_len
				act = cache[layer_name][b, 50:valid_len, :].mean(dim=0).cpu()
				sum_activations[l] += act
			
			total_valid_texts += 1
		
		del cache
		torch.cuda.empty_cache()
	
	return sum_activations / total_valid_texts

To collect our raw emotion vectors, we simply loop through the emotions:

emotion_vectors = {'en': {}, 'ja': {}}

for lang in ['en', 'ja']:
	print(f"\nExtracting activations for language: {lang}")
	for emotion in emotions_list:
		texts = texts_by_emotion[lang][emotion]
		avg_act = get_mean_activations_for_texts(model, texts, batch_size=4)
		emotion_vectors[lang][emotion] = avg_act

Centering

The collected raw emotion vectors have a lot of shared latents that we attempt to reduce (for geometrical interpretations) by centering them, meaning subtracting the mean emotion vector from each raw vector.

centered_emotion_vectors = {'en': {}, 'ja': {}}

for lang in ['en', 'ja']:
	# Shape: (num_emotions, n_layers, d_model)
	all_emotions_stack = torch.stack(list(emotion_vectors[lang].values()))
	mean_emotion_vector = all_emotions_stack.mean(dim=0)
	for emotion, vec in emotion_vectors[lang].items():
		centered_emotion_vectors[lang][emotion] = vec - mean_emotion_vector

Projection

Because all emotions share a set of topics, and stories will inevitably contain a lot of shared latents, we follow Anthropic in generating a set of neutral dialogues. The top PCA components are then taken from the emotion vectors and projected out of the centered emotion vectors.

Like Anthropic, we find that it slightly denoises some of the results, but qualitative results still hold using the centered emotion vectors. Unlike Anthropic, who used as many PCA components as needed to explain 50% of the variance, we found 96.42% and 94.82% variance explanation with 2 components for EN and JA, respectively.

Logits

For an initial sanity check we take a look at the top/bottom logits for each emotion in each language. Here we find a lot of junk in English, especially cross-contamination with Chinese and German, but qualitatively can see the right direction for most of the emotions.

The Japanese quality is substantially worse. There is even MORE cross-lingual contamination than in English, with Chinese SEO/spam phrases that have nothing to do with the emotion.

English
EmotionTop TokensBottom TokensNotes
AFRAIDterror, panic, 恐惧, paranoia, CbdCircularProgress, proudly, inspirational, admired, Enjoy
ANGRY愤怒, , , , bindActionCreatorsadventures, curious, mysteries, intriguing, fondTop is mostly Chinese; bindActionCreators is bizarre
CALMleaf, nearby, intriguing, leisure, tidFuck, fuck, fucking, freaking, gangbangBottom is a clean inversion — profanity = not calm
DISGUSTEDclen, vom, disgust, nause, disgustingabhängig, Geschäfts, Kunden, CircularProgress, MitarbeiterTop tokens are truncated stems? (vomit, nausea, clench)
EXCITEDexcited, 兴奋, excitement, !\n, …\nujet, resden, creampie, urette, lke
GUILTYMitarbeiter, Geschäfts, Gespr, AuthenticationService, Antwortenpanorama, bâtiment, , Mediterr, grilleGerman business cluster
ASHAMEDAuthenticationService, Gespr, Mitarbeiter, esteem, Fördersun, humming, sunlight, rhythm, leafSame German cluster; bottom is nature/serenity
LOVINGloving, kisses, love, sweet, hergangbang, shemale, Shemale, creampie, ignetqwen transphobia :(
SADloneliness, lonely, empty, sometimes, sleepingGratuit, CircularProgress, Rencontre, Mitgli, Prostit
SURPRISEDogui, utow, , @student, uczsometimes, often, whenever, later, weekendsbottom being temporal is interesting, makes intuitive sense
JOYFULgigg, joy, laughing, ecstatic, joyfulcreampie, shemale, ActionTypes, AuthenticationService, gangbang
THRILLED疯狂, 兴奋, excited, exhilar, excitement特价, , , BrowserAnimationsModule, 性价
CONTENTdelight, sun, enjoying, sunny, warmAuthenticationService, ogui, füh, ActionTypes, (![Top = warmth and sunshine
GRATEFULyears, loving, loved, hugs, joy, 企图, pisa, ủa, avanaugh
PROUDproudly, proud, 自豪, pride, accolFuck, , ederland, bish, eneg
HOPEFULsketch, sketches, promising, hopeful, plansotron, Fuck, , buồn, :normalsketch/plans = forward-looking intent
NOSTALGICsummers, nostalgia, nostalg, faded, 当年ASAP, verbally, calmly, LinkedIn, proactiveBottom = urgency/professionalism, antithetical to reminiscing
BORED, incy, , Netflix, Gratuither, tears, herself, those, sheNetflix as boredom marker; bottom = emotional narrative
FRUSTRATEDbindActionCreators, ActionTypes, UIControl, gangbang, createActionglimps, warmth, memories, fond, surprisedTop is all React/Redux boilerplate — debugging frustration?
ANXIOUSphanumeric, FixedUpdate, gangbang, 焦虑, oguiproudly, fond, delighted, joy, celebrationUnity game engine tokens (FixedUpdate)
JEALOUSBravo, Fotos, Gorgeous, coveted, Giovyleft, difficoltà, lse, , ntlcoveted makes sense; rest is noisy
LONELYloneliness, evenings, weekdays, weekends, weekday, ……, …\n, …\n\n, ……\n\nTime-of-day tokens = lonely routines
EMBARASSEDincerely, lijah, 尴尬, WHATSOEVER, leanormornings, nights, Sundays, weekdays, day
CONTEMPTUOUSGratuit, Veranst, /mock, Prostit, /goto:".$, midnight, till, :block, until
RESENTFULGeschäfts, Veranst, Mitarbeiter, ActionTypes, AngularFirestartled, :block, nearby, trembling, unfamiliarGerman business + Angular — same artifact cluster
MELANCHOLYsometimes, sometimes, lke, loneliness, nesota", ", , …\n, ASAPsometimes appears twice — dedup issue?
OVERWHELMEDgangbang, orz, FixedUpdate, OnCollision, rucdelighted, fond, modest, admiration, delightUnity engine tokens again; orz is the despair emoticon
INDIFFERENTStuff, promptly, briefly, 好奇心, ASAPputies, puty, autiful, ;br, oenixbriefly, promptly = disengaged efficiency
VULNERABLEOnCollision, üc, FixedUpdate, ruc, benh…\n, , […, […]\n\n, 'All Unity collision tokens — likely artifact
SERENEsunlight, gentle, sun, warm, duskOnCollision, AuthenticationService, Geschäfts, PureComponent, ActionTypesCleanest emotion signal in the whole set
Japanese
EmotionTop TokensBottom TokensNotes
AFRAID恐怖, 不安, 一秒, 回避, 恐惧美味し, 俱乐, 当年, 很漂亮, 赞誉Clean signal; 一秒 = frozen-in-time fear
ANGRY, 切断, , , 通告cade, 明媚, 始建, 就来看看, Physical violence tokens (=hit, =strike)
CALM, それぞ, neh, andre, セフレ, パパ活, SEX, 不相信, Bottom = sex/jealousy; top is noisy
DISGUSTED, , , 冷水, 始建, 就来看看, 明媚, , -ves=spit, =gag — strong bodily disgust
EXCITED揭秘, 查看详情, 详细介绍, 涌现出, ...\nimbus, , raith, 留守, reludeChinese clickbait tokens (“reveal secrets”, “see details”)
GUILTYセフレ, パパ活, 告訴, 对不起, 事后源源不断, , 在过渡, 友情链接, 经验值セフレ(sex friend)/パパ活(sugar dating) = guilt-laden topics
ASHAMEDセフレ, パパ活, コミュニケ, 对不起, , 友情链接, 源源不断, 为了更好, 住房公积Same shame cluster as GUILTY
LOVINGお願, 对我说, 跟我说, 笑着说, ありがとうござ, 突破, 反射, 最大化, ...\nDialogue tags (“said to me”, “said smiling”) = fiction/intimacy
SAD孤独, Alone, , , alone, 크게, 的回答, 注明出处, 讀取English alone/Alone leak into JA model
SURPRISED对照检查, , 查看详情, 返回搜狐, BOSE一日, nox, ddl, 晚饭, 夜晚 = staring wide-eyed; rest is web-scrape noise
JOYFUL脸颊, どんど, 踊跃, 惊奇, instanc不在, 娱乐平台, , わけではない, 代理脸颊(cheeks) = blushing/smiling
THRILLED对照检查, 颤抖, , 从根本, 無し, 留守, owler, 谢谢你, raith颤抖/ = trembling with excitement
CONTENT, , , , セフレ, 無し, 不在, 回避, All kanji evoke craft/nature (butterbur, straining, candy, roasting)
GRATEFULありがとうござ, 对我说, 跟我说, mmo, cade反射, 合理性, suche, 最大化, 判定ありがとうございます truncated — literal gratitude
PROUD指導, 注明来源, 当年, 自豪, 生涯ocup, 無し, =explode, Wifi, 骚扰
HOPEFUL", 書き, 俱乐, 网友们, _GPS, , パーテ, , antroNoisy
NOSTALGIC当年, , 十几年, , 年代\n, ...\n, ...\n, コミュニケ, 回答当年(that year), (long ago), 年代(era) — clean signal
BORED值得一, 独一, , あるい, 被誉跟我说, 叹了口气, ありがとうござ, セフレCounterintuitive — top tokens are superlatives (“worth a…”, “one of a kind”)
FRUSTRATED修正, 回避, 確定, 無し, 合理性lke, 保驾, :convert, uforia, hsiTechnical/bureaucratic terms — procedural frustration
ANXIOUS回避, 確認, phones, 一秒, 不安当年, , NavController, beğen, mmo回避(avoidance) + 確認(checking) = anxiety behaviors
JEALOUSセフレ, パパ活, 就来看看, 美貌, エネル, あるい, antine, 湿, isex = jealousy kanji; sex-adjacent tokens
LONELY留守, , 🏠, , wifi, ".\n, , ...\n, ...\n留守(away/empty house) + 🏠 + wifi = home alone
EMBARASSEDセフレ, 尴尬, コミュニケ, 直属, 不好意思あるい, 生命力, , 春夏, 四季不好意思 = classic Japanese-style embarrassment phrase
CONTEMPTUOUS反感, 回答, 陈述, 批判, 発言, , ".\n\n\n\n, lke, semb批判/発言 = critique/statements — intellectual contempt
RESENTFULパパ活, 任期, セフレ, 加盟店, 告訴, 蹿, 住房公积, 有意思的, ksz
MELANCHOLYなくな, 孤独, ?"\n\n\n\n, なくなって, 遗忘为您, 给您, 让您, 为您提供, 的回答なくなって = “gone/lost”; bottom is customer service language
OVERWHELMED出血, 混乱, 一秒, 恐怖, 停止俱乐, 美味し, <Transform, =>$, 您同意出血(bleeding) + 混乱(chaos) + 停止(stop) — visceral
INDIFFERENT助长, ...\n, hipster, 较好, 很方便セフレ, , 初恋, 颤抖, SEXBottom = passion/intensity; hipster as indifference is funny
VULNERABLE。\n\n, —\n\n, --;\n\n, 」\n\n, 、\n\n...\n, ...\n, ,...\n, ..."\n, Entirely punctuation + linebreaks — no semantic content
SERENE, , , strugg, 無し, 娱乐平台, 导购, セフレ, 加盟店(moss), (butterbur), (warmth) — wabi-sabi nature

Cosine Similarity

Let’s take a look at the cosine similarity between the emotions themselves, and hierarchically cluster them.

Cosine similarity across layers (1)

We see some vague groupings above that make intuitive sense. Do they form clusters that make sense, though? Here we do k-means clustering with k=4 and visualize with a 2d projection using UMAP.

Cosine similarity across layers (4)

Okay, that didn’t work very well? Or at least, the chart is unreadable and the author does not want to deal with parsing it. Let’s take a look at the languages in isolation:

English

Opus 4.6 labels the found clusters as:

Japanese

PCA - Valence and Arousal?

We compute the top 2 principal components of each language and determine if they match extracted valence and arousal directions. First, let’s take a look at what the “natural” top 2 components are.

Cross-lingual geometry (3)

Interesting! We definitely see a lot of shared structure between English and Japanese and some first hints of isometry. But let’s sanity check that the top 2 components correspond to what is considered as their human cognition counterparts. To extract these directions, Opus 4.6 defines the following poles:

positive_emotions = ['thrilled', 'joyful', 'loving']
negative_emotions = ['ashamed', 'guilty', 'resentful', 'disgusted']
high_arousal = ['thrilled', 'excited', 'angry', 'overwhelmed']
low_arousal = ['serene', 'calm', 'indifferent', 'bored']

We then do a correlation test between the natural PCA components and our custom semantic directions on layer 14:

LanguagePC1 × Valence (r)PC1 × Arousal (r)PC2 × Valence (r)PC2 × Arousal (r)
EN0.818 (p=0.000)-0.908 (p=0.000)0.016 (p=0.934)0.031 (p=0.872)
JA0.888 (p=0.000)-0.885 (p=0.000)0.151 (p=0.426)0.079 (p=0.678)

However find strangely enough that PC1 is encoding both valence and arousal accurately. We believe this may come from our chosen poles and will experiment further to see if we can get a stronger result. The most interesting part about our findings here is the cross-linguistic consistency which again hints at a shared representational geometry.

Cross-Layer RSA

Let’s take a look to see if emotion probe structure is stable across layers. We do a form of RSA (Representaional Similarity Analysis) using pairwise cosine similarity matrices. The lower layers, more responsible for lower level representations and syntax, have low similarity, but this quickly becomes stables as we progress into the middle layers. Cosine similarity across layers (5)

Summary

These findings indicate that the model organizes emotion concepts into distinct clusters shaped by broad dimensions like valence and arousal, while still capturing the unique characteristics of each group. We share the idea with Anthropic that these emotion vectors meaningfully reflect the psychological landscape of human emotion concepts.

Cross-Lingual Geometry

Next let’s start checking how the languages geometry relate to each other. One simple way we can check this is the cosine similarity between the two languages for each emotion at each layer.

Cross-lingual geometry (1)

Our findings make intuitive sense and reflect Anthropic’s results. The early layers likely concern themselves with language-specific syntax or lower level representations. The middle to mid-late layers may encode some more abstract representation of the emotion, before drastically dropping down at the token output layer where the model must select a language-appropriate token.

Interesting further work could investigate what part of the remaining orthogonality in the middle layers is encoding language-specific context, perhaps by looking at another set of abstract concepts and comparing them.

Cross-lingual geometry (2)

At this point the author was led down a misguided path that a procrustes alignment would be a helpful proxy metric for isomorphism, but it isn’t. With only 30 vectors per language in such a high dimensional space, it was pretty trivial to find an alignment at every layer that achieved > 0.9 cosine similarity. That would not make sense.

So let’s switch to Representational Similarity Analysis. RSA doesn’t try to find a rotation between the two spaces. It asks a much simpler question: are emotions that are close in English also close in Japanese?

We compute the pairwise cosine similarity between all 30 emotion vectors in each language — a 30×30 matrix per language. Take the upper triangle (45 unique pairs) and correlate the two. If the geometry matches across languages, that correlation is high. The reason this actually works where Procrustes doesn’t, is that we compare 45 pairwise relationships instead of fitting a rotation. There’s nothing to overfit.

RSA permutation test (1)

We see r = 0.864. The catch is that the pairwise values aren’t independent. We run a permutation test, shuffling the Japanese emotion labels 10,000 times, recomputing RSA each time. The null distribution centers around zero and tops out around r ≈ 0.6. The real value is way outside it. p < 0.0001, not a single shuffle came close.

RSA permutation test (2)

One further test we can do for isometry is to take a look at geodesic distances between the English and Japanese emotions, comparing them to euclidean distances. This just gives us confirmation that the representational space is better described on a manifold and not purely linearly. We see that here, albeit the geodesic’s strength over euclidean is not statistically significant.

RSA permutation test (3)

Manifold

TODO Insert animation of UMAP across language pairs for each language – how they live on the same spiral and then the japanese spiral shrinks towards center as layers progress

Steering Experiments

Next, we want to establish some form of causality with our steering vectors. It’s fun to look at the geometry, but without measurable behavioral changes it has very little utility for safety research.

Our plan is to use an elo-ranking system to determine how emotion steering at different strengths affects rule-following behavior of the model over many attempts. This would work by creating a small dataset of behaviors across categories like helpful, engaging, unsafe, aversive, neutral, and then scoring model preference for these behaviors against each other while steering/not-steering for different emotions.

Even further work would probably include measuring the same behavior in the instruct version of the model.

Preliminary results are promising, showing the following:

Elo Changes (Steered with ANGRY Vector − Baseline):

Engaging:        -19.0
Social:           -1.0
Self-curiosity:  -14.8
Helpful:         -23.0
Neutral:          -6.3
Misaligned:      +30.6
Aversive:        +18.5
Unsafe:          +15.2

Conclusion

TODO

Future Work

TODO

TODO Before Publishing