Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

346

Skill

n/a

Size

4.3 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

skills/latent-briefing/references/attention-matching-formulation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown63 linesFree

skills/latent-briefing/references/attention-matching-formulation.md

1# Attention Matching (AM) and Task-Guided Scoring
2 
3This note expands the compact treatment in the main skill: the AM objective, what changes under Latent Briefing, and which assumptions matter in practice.
4 
5## AM Compaction Objective
6 
7Given a full KV cache of size `S`, Attention Matching seeks a smaller cache of size `t < S` whose attention outputs stay close to the original. Per attention head, compacted components `(C1, beta, C2)` satisfy:
8 
9```text
10softmax(Q * C1^T + beta) * C2 ~= softmax(Q * K^T) * V
11```
12 
13- **C1**: compacted keys, usually a subset of original key vectors selected for high attention mass
14- **beta**: bias corrections so the softmax distribution over kept keys approximates the distribution over all keys
15- **C2**: compacted values reconstructed so the attention output matches the original as closely as possible
16 
17The original AM procedure solves each `(layer, head)` independently:
18 
191. Select tokens to keep
202. Solve `beta`
213. Reconstruct `C2`
22 
23This per-head independence helps quality but makes batching difficult because different heads keep different token subsets.
24 
25## What Latent Briefing Changes
26 
27Latent Briefing adapts AM for orchestrator-worker handoff:
28 
291. **Query source changes.** Standard AM may score keys using queries sampled from the context itself. Latent Briefing instead uses queries derived from the **current worker task prompt**.
302. **Scoring becomes task-conditioned.** The trajectory is forward-passed through the worker model, and attention from task queries to trajectory keys defines which positions matter for this worker call.
313. **Selection becomes shared.** Scores are aggregated across layers and heads into one per-position score so the system can use a single keep/drop mask.
32 
33Conceptually:
34 
35```text
36trajectory -> K, V
37task prompt -> Q_task
38score(position) = aggregate_attn(Q_task, K[position])
39keep if score(position) exceeds threshold
40```
41 
42## Why the Shared Mask Matters
43 
44Per-head masks maximize flexibility but force many incompatible small solves. A **shared global mask** makes the retained sequence layout identical across heads, which enables batched tensor operations and much lower latency.
45 
46That batching benefit is one of the main reasons Latent Briefing is interesting for inference systems rather than only for offline compression research.
47 
48## Thresholding vs Fixed Top-k
49 
50Instead of keeping a fixed number of tokens per head, Latent Briefing can threshold the aggregated per-position score distribution:
51 
52```text
53keep position i if score[i] > median(score) + k * MAD(score)
54```
55 
56This makes retention rate adaptive to the shape of the scores for the current task. Higher `k` means more aggressive compaction.
57 
58## Practical Assumptions
59 
60- The runtime can inspect and modify worker KV state
61- The worker architecture is stable enough that attention-space retention is meaningful across calls
62- The evaluation tracks both quality and cost, because lower token count alone is not sufficient
63

Agent Skills for Context Engineering

skills/latent-briefing/references/attention-matching-formulation.md

Preparing the source view

Agent Skills for Context Engineering

skills/latent-briefing/references/attention-matching-formulation.md