Attention Matching (AM) and Task-Guided Scoring

This note expands the compact treatment in the main skill: the AM objective, what changes under Latent Briefing, and which assumptions matter in practice.

AM Compaction Objective

Given a full KV cache of size S, Attention Matching seeks a smaller cache of size t < S whose attention outputs stay close to the original. Per attention head, compacted components (C1, beta, C2) satisfy:

softmax(Q * C1^T + beta) * C2 ~= softmax(Q * K^T) * V

C1: compacted keys, usually a subset of original key vectors selected for high attention mass
beta: bias corrections so the softmax distribution over kept keys approximates the distribution over all keys
C2: compacted values reconstructed so the attention output matches the original as closely as possible

The original AM procedure solves each (layer, head) independently:

Select tokens to keep
Solve beta
Reconstruct C2

This per-head independence helps quality but makes batching difficult because different heads keep different token subsets.

What Latent Briefing Changes

Latent Briefing adapts AM for orchestrator-worker handoff:

Query source changes. Standard AM may score keys using queries sampled from the context itself. Latent Briefing instead uses queries derived from the current worker task prompt.
Scoring becomes task-conditioned. The trajectory is forward-passed through the worker model, and attention from task queries to trajectory keys defines which positions matter for this worker call.
Selection becomes shared. Scores are aggregated across layers and heads into one per-position score so the system can use a single keep/drop mask.

Conceptually:

trajectory -> K, V
task prompt -> Q_task
score(position) = aggregate_attn(Q_task, K[position])
keep if score(position) exceeds threshold

Why the Shared Mask Matters

Per-head masks maximize flexibility but force many incompatible small solves. A shared global mask makes the retained sequence layout identical across heads, which enables batched tensor operations and much lower latency.

That batching benefit is one of the main reasons Latent Briefing is interesting for inference systems rather than only for offline compression research.

Thresholding vs Fixed Top-k

Instead of keeping a fixed number of tokens per head, Latent Briefing can threshold the aggregated per-position score distribution:

keep position i if score[i] > median(score) + k * MAD(score)

This makes retention rate adaptive to the shape of the scores for the current task. Higher k means more aggressive compaction.

Practical Assumptions

The runtime can inspect and modify worker KV state
The worker architecture is stable enough that attention-space retention is meaningful across calls
The evaluation tracks both quality and cost, because lower token count alone is not sufficient

Preparing the source view