Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/latent-briefing/references/attention-matching-formulation.md
1# Attention Matching (AM) and Task-Guided Scoring23This note expands the compact treatment in the main skill: the AM objective, what changes under Latent Briefing, and which assumptions matter in practice.45## AM Compaction Objective67Given a full KV cache of size `S`, Attention Matching seeks a smaller cache of size `t < S` whose attention outputs stay close to the original. Per attention head, compacted components `(C1, beta, C2)` satisfy:89```text10softmax(Q * C1^T + beta) * C2 ~= softmax(Q * K^T) * V11```1213- **C1**: compacted keys, usually a subset of original key vectors selected for high attention mass14- **beta**: bias corrections so the softmax distribution over kept keys approximates the distribution over all keys15- **C2**: compacted values reconstructed so the attention output matches the original as closely as possible1617The original AM procedure solves each `(layer, head)` independently:18191. Select tokens to keep202. Solve `beta`213. Reconstruct `C2`2223This per-head independence helps quality but makes batching difficult because different heads keep different token subsets.2425## What Latent Briefing Changes2627Latent Briefing adapts AM for orchestrator-worker handoff:28291. **Query source changes.** Standard AM may score keys using queries sampled from the context itself. Latent Briefing instead uses queries derived from the **current worker task prompt**.302. **Scoring becomes task-conditioned.** The trajectory is forward-passed through the worker model, and attention from task queries to trajectory keys defines which positions matter for this worker call.313. **Selection becomes shared.** Scores are aggregated across layers and heads into one per-position score so the system can use a single keep/drop mask.3233Conceptually:3435```text36trajectory -> K, V37task prompt -> Q_task38score(position) = aggregate_attn(Q_task, K[position])39keep if score(position) exceeds threshold40```4142## Why the Shared Mask Matters4344Per-head masks maximize flexibility but force many incompatible small solves. A **shared global mask** makes the retained sequence layout identical across heads, which enables batched tensor operations and much lower latency.4546That batching benefit is one of the main reasons Latent Briefing is interesting for inference systems rather than only for offline compression research.4748## Thresholding vs Fixed Top-k4950Instead of keeping a fixed number of tokens per head, Latent Briefing can threshold the aggregated per-position score distribution:5152```text53keep position i if score[i] > median(score) + k * MAD(score)54```5556This makes retention rate adaptive to the shape of the scores for the current task. Higher `k` means more aggressive compaction.5758## Practical Assumptions5960- The runtime can inspect and modify worker KV state61- The worker architecture is stable enough that attention-space retention is meaningful across calls62- The evaluation tracks both quality and cost, because lower token count alone is not sufficient63