Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
examples/book-sft-pipeline/references/tinker.txt

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
text3177 linesFree
examples/book-sft-pipeline/references/tinker.txt
1# TINKER DOCUMENTATION
2This file contains the complete Tinker documentation and SDK reference.
3 
4## Table of Contents
5 
61. Documentation (MDX files)
72. Type Definitions (from tinker.types)
8 
9---
10 
11# PART 1: DOCUMENTATION
12 
13## File: index.mdx
14 
15# Tinker: a training API for researchers and developers
16 
17Tinker lets you focus on what matters in LLM fine-tuning – your data and algorithms – while we handle the heavy lifting of distributed training.
18 
19You write a simple loop that runs on your CPU-only machine, including the data or environment and the loss function. We figure out how to make the training work on a bunch of GPUs, doing the exact computation you specified, efficiently. To change the model you're working with, you only need to change a single string in your code.
20 
21Tinker gives you full control over the training loop and all the algorithmic details. It's not a magic black box that makes fine-tuning "easy". It's a clean abstraction that shields you from the complexity of distributed training while preserving your control.
22 
23Here's how the division of responsibilities works in practice:
24 
25| **You focus on** | **You write** | **We handle** |
26|---|---|---|
27|  **Datasets and RL environments**<br />Your custom training data |  **Simple Python script**<br />Runs on your CPU |  **Efficient distributed training of large models**<br />Llama 70B, Qwen 235B |
28|  **Training logic**<br />Your loss functions, training loop, and evals |  **API calls**<br />`forward_backward()`<br />`optim_step()`<br />`sample()`<br />`save_state()` |  **Reliability**<br />Hardware failures handled transparently |
29 
30## Features
31 
32What the Tinker service currently supports:
33 
34- Tinker lets you fine-tune open-weight models like the Qwen and Llama series, including large mixture-of-experts models like Qwen3-235B-A22B.
35- Tinker supports vision-language models (VLMs) like Qwen3-VL for image understanding tasks. See [Vision Inputs](/rendering#vision-inputs) for details.
36- Tinker implements low-rank adaptation (LoRA) fine-tuning, not full fine-tuning. However, we believe that LoRA gives the same performance as full fine-tuning for many important use cases, especially in RL (see [LoRA Without Regret](https://thinkingmachines.ai/blog/lora/)).
37- You can download the weights of your trained model to use outside of Tinker, for example with your inference provider of choice.
38 
39## A quick look at functionality
40 
41Tinker's main functionality is contained in a few key functions:
42 
43- `forward_backward`: feed in your data and loss function, and we'll compute and accumulate the gradients for you.
44- `optim_step`: update your model using the accumulated gradients
45- `sample`: Generate outputs from your trained model
46- other functions for saving and loading weights and optimizer state
47 
48## What's next?
49 
50Some features we expect to support in the future:
51 
52- Full fine-tuning
53 
54 
55---
56 
57## File: losses.mdx
58 
59import { CookbookLink } from '../components/CookbookLink'
60 
61# Loss functions in Tinker
62 
63For most use cases, you can use the Tinker API's built-in loss functions by passing in a string identifier to `forward_backward`, which supports cross-entropy and policy gradient objectives. When you need more control, `forward_backward_custom` enables arbitrary differentiable loss functions at the cost of an additional forward pass; we explain both approaches in this doc.
64 
65When you call `forward_backward`, you specify a loss function using a string that selects from a predetermined set of options, comprising the most common losses used for language model training.
66- **Input:** `forward_backward` expects a certain set of input tensors, passed in via `datum.loss_fn_inputs`, which is a dict mapping `str` to either a numpy or torch tensor
67- **Output:** `forward_backward` returns a `ForwardBackwardOutput`, which has a set of output tensors in `fwd_bwd_result.loss_fn_outputs`
68 
69For an example of using `forward_backward`, see `rl/train.py` in the Cookbook:
70```python
71import tinker
72import torch
73from tinker import TensorData
74 
75# Create training data with required inputs
76datum = tinker.Datum(
77    model_input=input_tokens,
78    loss_fn_inputs={
79        "target_tokens": TensorData.from_torch(torch.tensor(target_tokens)),
80        "logprobs": TensorData.from_torch(torch.tensor(sampling_logprobs)),  # Reference logprobs
81        "advantages": TensorData.from_torch(torch.tensor(advantages)),
82    }
83)
84 
85# Option 1: Use importance sampling REINFORCE
86fwd_bwd_result = await training_client.forward_backward_async(
87    [datum], loss_fn="importance_sampling"
88)
89 
90# Option 2: Use PPO with clipping
91fwd_bwd_result = await training_client.forward_backward_async(
92    [datum], loss_fn="ppo"
93)
94```
95 
96## Basic loss functions
97 
98Currently, the Tinker API supports `cross_entropy` (for supervised learning), `importance_sampling`, `ppo`, `cispo` and `dro` for RL. We denote the training model as $p_{\theta}$, the sampling distribution as $q$, and advantages as $A$. Also, for notation simplicity we omit the query and denote the full model completion sequence of tokens as $x$.
99 
100All losses are applied at the token level and tensors below have shape `(N,)` where `N` is `model_input.length`. They can be provided as `numpy.ndarray` or `torch.Tensor`, and the return values will use the same tensor type.
101 
102### Supervised learning: `cross_entropy`
103 
104For SL, we implement the standard cross-entropy loss (i.e., negative log-likelihood), which optimizes the policy $p_\theta$ to maximize the log-probability of the tokens $x$:
105 
106$$
107\mathcal{L(\theta)} = -\mathbb{E}_x[\log p_\theta(x)]
108$$
109 
110where `weights` is either 0 or 1, typically generated from `renderer.build_supervised_example()` which returns `(model_input, weights)` (i.e., to specify the desired assistant turns to train on).
111 
112This is implemented as:
113 
114```python
115# Apply weights and compute elementwise loss
116elementwise_loss = -target_logprobs * weights
117# Apply sum reduction to get the total loss
118loss = elementwise_loss.sum()  # scalar
119```
120 
121- **Input tensors:**
122  - `target_tokens: array[(N,), int]` - Target token IDs
123  - `weights: array[(N,), float]` - Token-level loss weights (typically from the renderer)
124- **Output tensors:**
125  - `logprobs: array[(N,), float]` - Log probabilities of predicted tokens
126- **Output diagnostics:**
127  - `loss:sum` (scalar) - Sum of weighted cross-entropy losses
128 
129### Policy gradient: `importance_sampling`
130 
131For RL, we implement a common variant of the policy gradient objective, used in practical settings where the *learner policy* $p$ may differ from the *sampling policy* $q$, which is common due to, e.g., [non-determinism](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). The issue is that if these policies differ, then the objective:
132 
133$$
134\mathcal{L}(\theta) = \mathbb{E}_{x\sim p_\theta}\bigl[A(x)\bigr]
135$$
136 
137is not computed in an unbiased why due to $x \sim q$ (sampler) not exactly matching the desired $x \sim p_\theta$ (learner). To correct the bias, we use a modified "importance sampling" objective:
138 
139$$
140\mathcal{L}_{\text{IS}}(\theta) = \mathbb{E}_{x\sim q}\Bigl[\frac{p_\theta(x)}{q(x)}A(x)\Bigr],
141$$
142 
143which yields the correct expected reward. In the formula above:
144 
145- $\log p_\theta(x)$ – `target_logprobs` is from the learner, on the forward part of the `forward_backward` pass.
146- $\log q(x)$ – `sampling_logprobs` is from the sampler, recorded during sampling as a correction term.
147 
148This is implemented as:
149 
150```python
151# Compute probability ratio
152prob_ratio = torch.exp(target_logprobs - sampling_logprobs)
153# Compute importance-weighted loss
154loss = -(prob_ratio * advantages).sum()
155```
156 
157- **Input tensors:**
158  - `target_tokens: array[(N,), int]` - Target token IDs (from the sampler $q$)
159  - `logprobs: array[(N,), float]` - `sampling_logprobs` for the tokens
160  - `advantages: array[(N,), float]` - Advantage values for RL (positive to reinforce, negative to discourage)
161- **Output tensors:**
162  - `logprobs: array[(N,), float]` - `target_logprobs` for the tokens
163- **Output diagnostics:**
164  - `loss:sum` (scalar) - Sum of importance-weighted policy gradient losses $\mathcal L_{\text{IS}}$
165 
166### Proximal Policy Optimization: `ppo`
167 
168PPO ([Schulman et al., 2017](https://arxiv.org/abs/1707.06347)) addresses issues with standard policy gradient methods by introducing a clipping objective that limits policy updates within a close neighborhood of the sampling distribution. This prevents updates that are too large in policy space, especially when taking multiple gradient steps on the same rollout distribution.
169 
170The objective clips the importance ratio $\frac{p_\theta(x)}{q(x)}$ to prevent large policy updates, where $p_\theta$ is the learner policy and $q$ is the sampling policy. Note that the PPO clipping and loss computation is applied token-wise, computing the loss for each token independently.
171 
172The PPO clipping objective is:
173 
174$$
175\mathcal{L}_{\text{CLIP}}(\theta) = -\mathbb{E}_{x \sim q}\left[\text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \cdot A(x)\right]
176$$
177 
178The final PPO loss combines the clipped and unclipped objectives:
179 
180$$
181\mathcal{L}_{\text{PPO}}(\theta) = -\mathbb{E}_{x \sim q}\left[\min\left(\frac{p_\theta(x)}{q(x)} \cdot A(x), \text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \cdot A(x)\right)\right]
182$$
183 
184where $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$ are hyperparameters (currently fixed to 0.2 in Tinker).
185 
186This is implemented as:
187 
188```python
189# Compute probability ratio
190prob_ratio = torch.exp(target_logprobs - sampling_logprobs)
191# Apply clipping
192clipped_ratio = torch.clamp(prob_ratio, clip_low_threshold, clip_high_threshold)
193# Compute both objectives
194unclipped_objective = prob_ratio * advantages
195clipped_objective = clipped_ratio * advantages
196# Take minimum (most conservative)
197ppo_objective = torch.min(unclipped_objective, clipped_objective)
198# PPO loss is negative of objective
199loss = -ppo_objective.sum()
200```
201 
202 
203**Example with custom clipping thresholds:**
204```python
205fwd_bwd_result = await training_client.forward_backward_async(
206    data=data,
207    loss_fn="ppo",
208    loss_fn_config={"clip_low_threshold": 0.9, "clip_high_threshold": 1.1}
209)
210```
211 
212**Additional Notes:**
213- The loss formulation above is quite general, since the user can organize the data generation and advantage estimation in their own code. For example, the main RL training scripts in the Tinker Cookbook use group-based rollouts with per-group advantage centering similar to GRPO ([Shao et al., 2024](https://arxiv.org/abs/2402.03300)).
214- The functional implementations of REINFORCE and PPO do not use an additional KL term like the original GRPO work, which has been noted to be mathematically inconsistent ([Zhang et al., 2025](https://arxiv.org/abs/2505.17508); [Tang et al., 2025](https://arxiv.org/abs/2506.09477)). However, it is possible to include a KL regularization term as part of the reward, which is mathematically correct and we provide this option in our RL training <CookbookLink path="tinker_cookbook/rl/train.py">code and examples</CookbookLink>  (consider the incorporate_kl_penalty function).
215- Notice that for all objectives we sum the token-level losses over the sequence length unlike some other loss implementations. If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation.
216 
217### Clipped Importance Sampling Policy Optimization: `cispo`
218 
219CISPO ([Chen et al., 2024](https://arxiv.org/abs/2506.13585); [Khatri et al., 2024](https://arxiv.org/abs/2510.13786)) is a policy gradient method that uses a clipped importance ratio as a coefficient for the policy gradient. Unlike PPO which clips the objective directly, CISPO clips the ratio and uses it to weight the log probability. Mathematically the objective is:
220The CISPO objective is:
221 
222$$
223\mathcal{L}_{\text{CISPO}}(\theta) = \mathbb{E}_{x \sim q}\left[\textbf{sg}\left( \text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \right) \cdot \log p_\theta(x) \cdot A(x)\right]
224$$
225 
226This is implemented as:
227 
228```python
229# Compute probability ratio
230prob_ratio = torch.exp(target_logprobs - sampling_logprobs)
231# Apply clipping
232clipped_ratio = torch.clamp(prob_ratio, clip_low_threshold, clip_high_threshold)
233# Compute CISPO objective (detach the clipped ratio)
234cispo_objective = clipped_ratio.detach() * target_logprobs * advantages
235# CISPO loss is negative of objective
236loss = -cispo_objective.sum()
237```
238 
239 
240Similarly to the PPO objective you can pass loss function parameters in the following way:
241 
242```python
243fwd_bwd_result = await training_client.forward_backward_async(
244    data=data,
245    loss_fn="cispo",
246    loss_fn_config={"clip_low_threshold": 0.8, "clip_high_threshold": 1.2}
247)
248```
249 
250### Direct Reward Optimization: `dro`
251 
252DRO ([Richemond et al., 2024](https://arxiv.org/abs/2405.19107); [Kimi Team et al., 2025](https://arxiv.org/abs/2501.12599)) is a general off-policy (and even offline) reinforcement learning method that uses a quadratic penalty term to constrain the policy update. Notice that this loss uses a different (soft) formulation of the advantage estimation, which needs to be implemented on the client side.
253The DRO objective is:
254 
255$$
256\mathcal{L}_{\text{DRO}}(\theta) = \mathbb{E}_{x \sim q}\left[\log p_\theta(x) \cdot A(x) - \frac{1}{2}\beta \left(\log \frac{p_\theta(x)}{q(x)}\right)^2\right]
257$$
258 
259 
260This is implemented as:
261 
262```python
263# Compute quadratic penalty term
264quadratic_term = (target_logprobs - sampling_logprobs) ** 2
265# Compute DRO objective
266dro_objective = target_logprobs * advantages - 0.5 * beta * quadratic_term
267# DRO loss is negative of objective
268loss = -dro_objective.sum()
269```
270 
271And similarly to other objectives, can specify the loss hyper-parameter as:
272 
273```python
274fwd_bwd_result = await training_client.forward_backward_async(
275    data=data,
276    loss_fn="dro",
277    loss_fn_config={"beta": 0.05}
278)
279```
280 
281## Flexible loss functions: `forward_backward_custom`
282 
283For use cases outside of the above, we've provided the more flexible (but slower) methods `forward_backward_custom` and `forward_backward_custom_async` to compute a more general class of loss functions.
284 
285### Usage
286 
287Here's a simple example of a custom loss function:
288 
289```python
290def logprob_squared_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:
291    loss = (logprobs ** 2).sum()
292    return loss, {"logprob_squared_loss": loss.item()}
293```
294 
295You can call this loss function with `forward_backward_custom` like:
296 
297```python
298loss, metrics = training_client.forward_backward_custom(data, logprob_squared_loss)
299```
300 
301You can also define loss functions which operate on multiple sequences at a time. For example, a loss function that computes the variance across the sequences (although practically useless) can be implemented as:
302 
303```python
304def variance_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:
305    flat_logprobs = torch.cat(logprobs)
306    variance = torch.var(flat_logprobs)
307    return variance, {"variance_loss": variance.item()}
308```
309 
310A more practical use case would be to compute a Bradley-Terry loss on pairwise comparison data -- a classic approach in RL from human feedback, as introduced and popularized by [Learning to Summarize](https://arxiv.org/abs/2009.01325). Similarly, we can also implement [Direct Preference Optimization](https://arxiv.org/abs/2305.18290), which also computes a loss involving pairs of sequences; see the [DPO guide](/preferences/dpo-guide) for more details.
311 
312If you're using a custom loss function that you think is generally useful, please let us know, and we'll add it to the list of built-in loss functions.
313 
314We detail the `async` version of methods in the [Async and Futures](./async) of these docs.
315 
316### How `forward_backward_custom` works
317 
318---
319 
320## File: publish-weights.mdx
321 
322# Publishing weights
323 
324If you've trained a model that you'd like to share with the community, you can
325publish any number of checkpoints you've previously saved.
326 
327Once published, your checkpoint can be loaded by any tinker user and used to
328further train a new model or be sampled against.
329 
330### Publishing
331 
332```bash
333tinker checkpoint publish $TINKER_CHECKPOINT_PATH
334```
335 
336where `$TINKER_CHECKPOINT_PATH` is a checkpoint path in the form of `tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af:train:17/weights/checkpoint_id_to_publish`.
337 
338You may confirm your checkpoint is published by dumping the checkpoint info and checking the `Public` property:
339 
340```bash
341tinker checkpoint info tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af/weights/checkpoint_id_to_publish
342                              Checkpoint: weights/checkpoint_id_to_publish
343┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
344┃ Property        ┃ Value                                                                          ┃
345┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
346│ Checkpoint ID   │ weights/checkpoint_id_to_publish                                               │
347│ Type            │ training                                                                       │
348│ Tinker Path     │ tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af/weights/checkpoint_id_to_publish │
349│ Size            │ 342.4 MB                                                                       │
350│ Public          │ No                                                                             │
351│ Created         │ 23 minutes ago                                                                 │
352│ Training Run ID │ 14bdf3a1-0b95-55c7-8659-5edb1bc870af                                           │
353└─────────────────┴────────────────────────────────────────────────────────────────────────────────┘
354```
355 
356### Unpublishing
357 
358```bash
359tinker checkpoint unpublish $TINKER_CHECKPOINT_PATH`
360```
361 
362### Loading public weights
363 
364Loading public weights is exactly the same as loading a non-public one:
365 
366```python
367ckpt_path = ...
368training_client = service_client.create_training_client_from_state(ckpt_path)
369```
370 
371 
372---
373 
374## File: supervised-learning.mdx
375 
376import { CookbookLink } from '../components/CookbookLink'
377 
378# Cookbook: Supervised learning
379 
380This section takes you through examples from the Tinker Cookbook that relate to supervised learning.
381 
382In general, supervised learning (SL) means learning an input-output mapping from labeled data. In the context of language model fine-tuning, this means **minimizing a weighted cross-entropy loss** on token sequences---equivalently, maximizing the log-probability of the specified target tokens.
383 
384There are a few ways that SL is commonly used in LLM fine-tuning pipelines:
385 
386- *Instruction tuning*: This is the first step in post-training pipelines, applied to the base (raw, pretrained) model. Typically, we do SL on a high-quality dataset that demonstrates the correct format and style, while boosting the model's reasoning and instruction-following.
387- *Context distillation* / *prompt distillation*: let's say we have a generic model that can do chat / instruction following / reasoning, but we want to adjust how it behaves in a certain scenario. We can add some instructions to the system message of our model. However, the system message might grow impractically long and start ignoring some of its instructions. So it's often better to create a supervised dataset on a narrow prompt distribution, with a shorter set of instructions that that are targeted at these prompts.
388 
389We'll cover both of these use cases in this documentation and related Cookbook code.
390 
391The library code implementing supervised learning can be found in the <CookbookLink path="tinker_cookbook/supervised">`supervised`</CookbookLink> directory.
392 
393 
394---
395 
396## File: preferences.mdx
397 
398import { CookbookLink } from '../components/CookbookLink'
399 
400# Preferences
401 
402# Learning from Preferences
403 
404In this section, we focus on learning from **pairwise feedback**, where we have preference data indicating which of two completions is better for a given prompt. This kind of feedback is a natural fit for tasks where there's not a simple correctness criterion that can be computed programmatically. These preferences might be collected from human evaluators or generated bya model.
405 
406## Two Approaches to Preference Learning
407 
408When you have pairwise preference data, there are two main approaches:
409 
4101. **Direct Preference Optimization (DPO)**: Directly update the policy to prefer chosen responses over rejected ones, without needing a separate reward model. This is simpler and computationally cheaper. See the [DPO Guide](/preferences/dpo-guide) for details.
411 
4122. **Reinforcement Learning from Human Feedback (RLHF)**: Train a reward model on preference data, then use reinforcement learning to optimize the policy against this reward model. This two-stage approach provides more flexibility. See the the [RLHF example](/preferences/rlhf-example) for details.
413 
414 
415---
416 
417## File: docs-outline.mdx
418 
419# Navigating these docs
420 
421These docs provide guides to both Tinker and the Tinker Cookbook.
422 
423The first half, "Using the Tinker API", walks you through the fundamentals of Tinker:
424 
425- [Installation](./install) explains how to install both `tinker` and `tinker-cookbook`, and points you to the Tinker Console for your API key.
426- [Training and Sampling](./training-sampling) takes you through your first training run: setting up your training data, performing the run, and sampling from the model to test the run.
427- [Loss Functions](./losses) starts to get into the detail. Tinker supports a variety of built-in loss function, but also allows you to use arbitrary differentiable loss functions.
428- [Saving and Loading](./save-load) explains the checkpoint types available in Tinker, and how to restart a run from a checkpoint.
429- [Async and Futures](./async) explains Tinker's `sync` and `async` API variants, and how Futures works as Tinker's requests structure.
430- [Model Lineup](./model-lineup) is regularly updated with the models available to fine-tune in Tinker.
431 
432The second half, "The Tinker Cookbook", provides recipes for how to use the Tinker API for research and applications. You are welcome to adapt these directly for your own use cases.
433 
434- [Rendering](./rendering) explains how we convert from a conversation data structure to a list of tokens.
435- [Supervised Learning](./supervised-learning) explains basic SL and walks you through your first SL training loop. We make some suggestions for hyperparameter selection and detail how you can run your own hyperparameter sweep. We also show you how to perform prompt distillation.
436- [Reinforcement Learning](./rl) explains the basics of RL and walks you through your first RL run. We explain and provide code for creating your own RL environments and training on them. We provide a simple training loop for you to use and adapt, and explain RL hyperparameters and loss functions in detail.
437- [Preferences](./preferences) is a guide to learning from pairwise feedback, where  we have preference data indicating which of two completions is better for a given prompt. We walk you through two approaches to learning from pairwise preference data: direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF).
438- [Evaluations](./evals) explains how you can use Tinker's outputs to run inline and offline evals on your runs.
439- [Completers](./completers) explains how Tinker implements policies, and provides two examples of how to use these in training.
440- [LoRA Primer](./lora-primer) explains the basic background of LoRA, and how to choose hyperparameters.
441 
442 
443---
444 
445## File: lora-primer.mdx
446 
447# LoRA Primer
448 
449Tinker supports [LoRA fine-tuning](https://arxiv.org/abs/2106.09685), which adjusts a small number of parameters, rather than full fine-tuning, which adjusts all of the parameters of the original model.
450 
451Our current understanding is that LoRA has equivalent performance to full fine-tuning when doing RL or doing SL on small datasets, while it has worse performance on larger datasets. In more detail:
452 
453- For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.
454- For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.
455- In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix.
456- Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.
457- LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.
458 
459See [LoRA Without Regret](https://thinkingmachines.ai/blog/lora) for more details and experimental results.
460 
461## Hyperparameters
462 
463The learning rate (LR) is usually the most important hyperparameter in your ML experiments.
464 
465 
466LoRA requires a much larger LR than full fine-tuning---typically 20-100x larger, depending on model size. People often mistakenly retain their full fine-tuning LR when they port their code to use LoRA, leading them to conclude that LoRA works poorly.
467 
468**Calculate the correct LoRA learning rate:**
469 
470We've provided a utility that calculates the factor you should scale the full fine-tuning LR by to get the equivalent LoRA LR:
471 
472```python
473from tinker_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
474 
475model_name = "meta-llama/Llama-3.1-8B"
476print(get_lora_lr_over_full_finetune_lr(model_name))
477```
478 
479Note that for `Llama-3.2-1B`, the factor is 32, while for `Llama-3.1-70B`, the factor is 128.
480 
481## What is LoRA exactly?
482 
483LoRA is short for Low-Rank Adaptation. Given that the original model has a weight matrix $W$, we replace it with a new weight matrix $W'=W + BA$, where $B$ and $A$ are low-rank matrices. If $W$ is an $n \times n$ matrix, then $B$ and $A$ are $n \times r$ and $r \times n$ matrices, respectively, where $r$ is the rank of the low-rank approximation. The default $r$ used by tinker is $32$.
484 
485The fact that LoRA uses a low-rank approximation of weight matrices is not terribly important. We prefer to think of LoRA as just a random projection of the parameter space that happens to be efficient to implement. When training with RL or small SL datasets, we are only learning a small amount of information, and this reduced set of parameters is more than enough.
486 
487 
488## What rank to use?
489 
490The default rank used by tinker is $32$. However, if you're doing SL on a large dataset, you should use a larger rank. For supervised learning, as a very rough approximation, LoRA will give good results as long as the number of LoRA parameters is at least as large as the number of completion tokens (i.e., weight=1 tokens). You can calculate the number of LoRA parameters with the following utility:
491 
492```python
493from tinker_cookbook.hyperparam_utils import get_lora_param_count
494 
495model_name = "meta-llama/Llama-3.1-8B"
496print(get_lora_param_count(model_name, lora_rank=32))
497```
498 
499For reinforcement learning, we've found that small ranks give equivalent performance to larger ranks and full fine-tuning.
500 
501Note that conveniently, the optimal learning rate does *not* depend on the LoRA rank. In fact, you can verify that if you train with SL on different ranks (but with the same LR), you'll get exactly the same learning curves for the first few steps of training.
502 
503 
504---
505 
506## File: evals.mdx
507 
508import { Callout } from 'nextra/components'
509import { CookbookLink } from '../components/CookbookLink'
510 
511# Evaluations
512 
513Our training scripts will print out training and test loss. Two common workflows for evaluations are to do inline evals during training and to do offline evals on various checkpoints from a run.
514 
515## Inline Evals
516 
517You can add inline evaluations to your training runs by configuring evaluator builders in advance for both supervised fine-tuning and RL training jobs.
518 
519### Supervised Fine-Tuning (`supervised.train`)
520Add one or both of the following to your config:
521 
522- **`evaluator_builders: list[EvaluatorBuilder]`** - Runs evaluations every `eval_every` steps
523- **`infrequent_evaluator_builders: list[EvaluatorBuilder]`** - Runs evaluations every `infrequent_eval_every` steps
524 
525### RL Training (`rl.train`)
526 
527Add the following to your config:
528 
529- **`evaluator_builders: list[SamplingClientEvaluator]`** - Runs evaluations every `eval_every` steps
530 
531For implementation guidance and a detailed example, see <CookbookLink path="tinker_cookbook/eval/evaluators.py">here</CookbookLink> and
532 <CookbookLink path="tinker_cookbook/eval/inspect_evaluators.py">here</CookbookLink> respectively.
533 
534 
535## Offline evals
536 
537We support and recommend several ways for creating and running your offline evaluations on your model checkpoints.
538 
539### Running Standard Evaluations with Inspect AI.
540 
541We support running many of the standard cited evaluations using the [Inspect AI library](https://github.com/UKGovernmentBEIS/inspect_ai).
542 
543We have provided a <CookbookLink path="tinker_cookbook/eval/run_inspect_evals.py">script</CookbookLink> to evaluate models using Tinker's internal sampling functionality as shown below.
544 
545```bash
546MODEL_PATH=tinker://FIXME # YOUR MODEL PATH HERE
547python -m tinker_cookbook.eval.run_inspect_evals \
548    model_path=$MODEL_PATH \
549    model_name=MODEL_NAME \ # YOUR MODEL_NAME HERE
550    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
551    renderer_name=RENDERER_NAME # YOUR RENDERER_NAME HERE
552```
553 
554Click [here](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/docs/evals/listing.yml) to view additional supported evaluations.
555 
556### Creating your own Sampling Evaluations
557 
558We recommend two ways to create your own evaluations:
559- creating your own tasks with Inspect AI and running like above
560- creating your own SamplingClientEvaluator
561 
562#### Create tasks with Inspect AI
563 
564In addition to passing in standard evaluations, you can create your own tasks using inspect ai as detailed [here](https://inspect.aisi.org.uk/tasks.html).
565 
566Here is a toy example of how to create an evaluation with an LLM-as-a-judge where we use a model produced by tinker as a grader.
567 
568```python
569import tinker
570from inspect_ai import Task, task
571from inspect_ai.dataset import MemoryDataset, Sample
572from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig
573from inspect_ai.model import Model as InspectAIModel
574from inspect_ai.scorer import model_graded_qa
575from inspect_ai.solver import generate
576from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling
577 
578QA_DATASET = MemoryDataset(
579    name="qa_dataset",
580    samples=[
581        Sample(
582            input="What is the capital of France?",
583            target="Paris",
584        ),
585        Sample(
586            input="What is the capital of Italy?",
587            target="Rome",
588        ),
589    ],
590)
591 
592service_client = tinker.ServiceClient()
593sampling_client = service_client.create_sampling_client(
594    base_model="meta-llama/Llama-3.1-8B-Instruct"
595)
596 
597api = InspectAPIFromTinkerSampling(
598    renderer_name="llama3",
599    model_name="meta-llama/Llama-3.1-8B-Instruct",
600    sampling_client=sampling_client,
601    verbose=False,
602)
603 
604GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())
605 
606 
607@task
608def example_lm_as_judge() -> Task:
609    """
610    Example task using LLM-as-a-judge scoring.
611 
612    Note: The grader model defaults to the model being evaluated.
613    To use a different grader model, specify it with --model-grader when using inspect directly.
614    """
615    return Task(
616        name="llm_as_judge",
617        dataset=QA_DATASET,
618        solver=generate(),
619        scorer=model_graded_qa(
620            instructions="Grade strictly against the target text as general answer key and rubric. "
621            "Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
622            partial_credit=False,
623            # model parameter is optional - if not specified, uses the model being evaluated
624            model=GRADER_MODEL,
625        ),
626    )
627```
628 
629Inspect also natively supports replacing our `GRADER_MODEL` with any openai-chat-completion style api (e.g. openrouter).
630 
631#### Create your own SamplingClientEvaluator
632 
633Alternatively, you can create your own SamplingClientEvaluator class instead of using Inspect AI. This is a lower
634level abstraction than the above with finer-grain control over running your evaluations.
635 
636We expose this to interface to allow users more control over their datasets and metrics. To illustrate, see this
637<CookbookLink path="tinker_cookbook/eval/custom_evaluators.py">custom evaluators</CookbookLink> example of how one might create their own complex SamplingClientEvaluator.
638 
639For a more illustrative toy instructive example see below.
640 
641```python
642from typing import Any, Callable
643 
644import tinker
645from tinker import types
646 
647from tinker_cookbook import renderers
648from tinker_cookbook.evaluators import SamplingClientEvaluator
649from tinker_cookbook.tokenizer_utils import get_tokenizer
650 
651class CustomEvaluator(SamplingClientEvaluator):
652    """
653    A toy SamplingClientEvaluator that runs a custom evaluation and returns its metrics.
654    """
655 
656    def __init__(
657        self,
658        dataset: Any,
659        grader_fn: Callable[[str, str], bool],
660        model_name: str,
661        renderer_name: str,
662    ):
663        """
664        Initialize the CustomEvaluator.
665        Args:
666            config: Configuration object containing all evaluation parameters
667        """
668        self.dataset = dataset
669        self.grader_fn = grader_fn
670 
671        tokenizer = get_tokenizer(model_name)
672        self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
673 
674    async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
675        """
676        Run custom evaluation on the given sampling client and return metrics.
677        Args:
678            sampling_client: The sampling client to evaluate
679        Returns:
680            Dictionary of metrics from inspect evaluation
681        """
682 
683        metrics = {}
684 
685        num_examples = len(self.dataset)
686        num_correct = 0
687 
688        sampling_params = types.SamplingParams(
689            max_tokens=100,
690            temperature=0.7,
691            top_p=1.0,
692            stop=self.renderer.get_stop_sequences(),
693        )
694 
695        for datum in self.dataset:
696            model_input: types.ModelInput = self.renderer.build_generation_prompt(
697                [renderers.Message(role="user", content=datum["input"])]
698            )
699            # Generate response
700            r: types.SampleResponse = await sampling_client.sample_async(
701                prompt=model_input, num_samples=1, sampling_params=sampling_params
702            )
703            tokens: list[int] = r.sequences[0].tokens
704            response: renderers.Message = self.renderer.parse_response(tokens)[0]
705            if self.grader_fn(response["content"], datum["output"]):
706                num_correct += 1
707 
708        metrics["accuracy"] = num_correct / num_examples
709        return metrics
710```
711 
712Here is an example of how we can use the above CustomEvaluator on a toy dataset and grader.
713 
714 
715```python
716QA_DATASET = [
717    {"input": "What is the capital of France?", "output": "Paris"},
718    {"input": "What is the capital of Germany?", "output": "Berlin"},
719    {"input": "What is the capital of Italy?", "output": "Rome"},
720]
721 
722def grader_fn(response: str, target: str) -> bool:
723    return target.lower() in response.lower()
724 
725evaluator = CustomEvaluator(
726    dataset=QA_DATASET,
727    grader_fn=grader_fn,
728    renderer_name="llama3",
729    model_name="meta-llama/Llama-3.1-8B-Instruct",
730 
731)
732 
733service_client = tinker.ServiceClient()
734sampling_client = service_client.create_sampling_client(base_model="meta-llama/Llama-3.1-8B-Instruct")
735 
736async def main():
737    result = await evaluator(sampling_client)
738    print(result)
739 
740asyncio.run(main())
741```
742 
743 
744---
745 
746## File: dev-tips.mdx
747 
748# Developer Tips
749 
750## AI-assisted development
751 
752We've provided a single-file version of the documentation that can be fed to LLMs for development: see [llms.txt](/llms.txt) and [llms-full.txt](/llms-full.txt).
753 
754 
755---
756 
757## File: async.mdx
758 
759# Async and Futures
760 
761## Sync and Async APIs
762 
763Every method in the Tinker Python library has both a synchronous (sync) and an asynchronous (async) version. The async variants end with `_async`:
764 
765| **Client** | **Sync method** | **Async method** |
766|---|---|---|
767| `ServiceClient` | `create_lora_training_client()` | `create_lora_training_client_async()` |
768| `TrainingClient` | `forward()` | `forward_async()` |
769| `SamplingClient` | `sample()` | `sample_async()` |
770| `RestClient` | `list_training_run_ids()` | `list_training_run_ids_async()` |
771 
772Tinker's `async` functionality requires an `asyncio` event loop, which you typically run like `asyncio.run(main())`.
773 
774**When to use each:**
775 
776- **Async:** Best for high-performance workflows where you need concurrency, especially when waiting on multiple network calls.
777- **Sync:** Simpler for scripts and learning examples. Easier to reason about but blocks on each operation.
778 
779The Tinker Cookbook generally uses `async` for implementations where performance is critical and sync for pedagogical examples.
780 
781## Understanding Futures
782 
783Most Tinker API methods are **non-blocking**, but may take a little while to run. They return immediately with a `Future` object that acknowledges that your request has been submitted. To get the actual result, you must explicitly wait:
784 
785**Sync Python:**
786```python
787future = client.forward_backward(data, loss_fn)
788result = future.result() # Blocks until complete
789```
790 
791**Async Python (note the double await):**
792```python
793future = await client.forward_backward_async(data, loss_fn)
794result = await future
795```
796 
797After the first `await`, you're guaranteed that the request has been submitted, which ensures that it'll be ordered correctly relative to other requests. The second `await` waits for the actual computation to finish and returns the numerical outputs. For operations like `forward_backward`, the second `await` also guarantees that operation has been applied to the model---for `forward_backward`, this means that the gradients have been accumulated in the model's optimizer state.
798 
799## Performance tips: overlap requests
800 
801For best performance, you should aim to submit your next request while the current one is running. Doing so is more important with Tinker than with other training systems because Tinker training runs on discrete [clock cycles](./under-the-hood#clock-cycles) (~10 seconds each). If you don't have a request queued when a cycle starts, you'll miss that cycle entirely.
802 
803**Example pattern for overlapping forward_backward and optim_step:**
804```python
805# Submit forward_backward
806fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)
807 
808# Submit optim_step immediately (don't wait for forward_backward to finish)
809optim_future = await client.optim_step_async(adam_params)
810 
811# Now retrieve results
812fwd_bwd_result = await fwd_bwd_future
813optim_result = await optim_future
814```
815 
816This pattern ensures both operations are queued and can be processed in the same [clock cycle](./under-the-hood#clock-cycles). In contrast, if you waited for `forward_backward` to complete before submitting `optim_step`, you might miss the next [clock cycle](./under-the-hood#clock-cycles).
817 
818 
819---
820 
821## File: download-weights.mdx
822 
823# Downloading weights
824 
825### CLI
826 
827```bash
828tinker checkpoint download $TINKER_CHECKPOINT_PATH
829```
830 
831See `tinker checkpoint download --help` for more details.
832 
833### SDK
834 
835You can also download checkpoints using the SDK.
836 
837Example:
838 
839```python
840import tinker
841import urllib.request
842 
843sc = tinker.ServiceClient()
844rc = sc.create_rest_client()
845future = rc.get_checkpoint_archive_url_from_tinker_path("tinker://<unique_id>/sampler_weights/final")
846checkpoint_archive_url_response = future.result()
847 
848# `checkpoint_archive_url_response.url` is a signed URL that can be downloaded
849# until checkpoint_archive_url_response.expires
850urllib.request.urlretrieve(checkpoint_archive_url_response.url, "archive.tar")
851```
852 
853Replace `<unique_id>` with your Training Run ID. This will save the LoRA adapter weights and config inside the `archive.tar` file.
854 
855 
856---
857 
858## File: overview-building.mdx
859 
860# Overview: Tinker Cookbook
861 
862The next sections provide a variety of guides for how to use the Tinker API for research and applications.
863 
864We expect people to use Tinker in a few different ways:
865 
8661. You want to define datasets and environments and plug them into existing training code from the Tinker Cookbook.
8672. You want to write your own training loops from scratch, starting with the basics.
8683. You want to understand the classes and other concepts in Tinker Cookbook so you can extend them to add new functionality.
869 
870Different parts of the docs will be tailored to these different approaches.
871 
872We'll start with a couple of general pages that'll be relevant to almost all of the use cases:
873 
874- [Rendering to Tokens](./rendering.mdx) -- how we convert from a conversation data structure to a list of tokens (a.k.a. chat templates).
875- [LoRA Primer](./lora-primer.mdx) -- basic background of LoRA, and how to choose hyperparameters. For most fine-tuning applications, LoRA will give results that are roughly the same as full fine-tuning, however, you need to use different learning rates.
876 
877 
878---
879 
880## File: save-load.mdx
881 
882# Saving and loading weights and optimizer state
883 
884During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides three methods to handle these cases:
885 
8861. `save_weights_for_sampler()`: saves a copy of the model weights that can be used for sampling.
8872. `save_state()`: saves the weights and the optimizer state. You can fully resume training from this checkpoint.
8883. `load_state()`: load the weights and the optimizer state. You can fully resume training from this checkpoint.
889 
890Note that (1) is faster and requires less storage space than (2).
891 
892Both `save_*` functions require a `name` parameter---a string that you can set to identify the checkpoint within the current training run. For example, you can name your checkpoints `"0000"`, `"0001"`, `"step_1000"`, etc.
893 
894The return value contains a `path` field, which is a fully-qualified path, which will look something like `tinker://<model_id>/<name>`. This path is persistent and can be loaded later by a new `ServiceClient` or `TrainingClient`.
895 
896### Example: Saving for sampling
897 
898```python
899# Setup
900import tinker
901service_client = tinker.ServiceClient()
902training_client = service_client.create_lora_training_client(
903    base_model="meta-llama/Llama-3.2-1B", rank=32
904)
905 
906# Save a checkpoint that you can use for sampling
907sampling_path = training_client.save_weights_for_sampler(name="0000").result().path
908 
909# Create a sampling client with that checkpoint
910sampling_client = service_client.create_sampling_client(model_path=sampling_path) #
911```
912 
913**Shortcut:** Combine these steps with:
914 
915```python
916sampling_client = training_client.save_weights_and_get_sampling_client(name="0000")
917```
918 
919### Example: Saving to resume training
920 
921Use `save_state()` and `load_state()` when you need to pause and continue training with full optimizer state preserved:
922 
923```python
924# Save a checkpoint that you can resume from
925resume_path = training_client.save_state(name="0010").result().path
926 
927# Load that checkpoint
928training_client.load_state(resume_path)
929```
930 
931### When to use `save_state()` and `load_state()`:
932 
933 
934- Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)
935- Adjusting hyperparameters or data mid-run
936- Recovery from interruptions or failures
937- Any scenario where you need to preserve exact optimizer state (momentum, learning rate schedules, etc.)
938 
939 
940---
941 
942## File: training-sampling.mdx
943 
944import { Callout } from 'nextra/components'
945 
946# Getting started with training and sampling
947 
948In this guide, we'll step you through using the Tinker Python library to do the basic operations needed for training and sampling.
949[View the complete Python script →](/quickstart.py.txt)
950 
951## Creating the training client
952 
953The main object we'll be using is the `TrainingClient`, which corresponds to a fine-tuned model that we can train and sample from.
954 
955First, set your Tinker API key environment variable. In the terminal where you'll run Python, or in your `.bashrc`, put `export TINKER_API_KEY=<your key>`.
956 
957Then, create a `ServiceInterface`. This lets you find out what base models are available to be fine-tuned.
958 
959```python
960import tinker
961service_client = tinker.ServiceClient()
962print("Available models:")
963for item in service_client.get_server_capabilities().supported_models:
964    print("- " + item.model_name)
965```
966You'll see a list of model names:
967```
968- meta-llama/Llama-3.1-70B
969- meta-llama/Llama-3.1-8B
970...
971- Qwen/Qwen3-VL-30B-A3B-Instruct
972- Qwen/Qwen3-VL-235B-A22B-Instruct
973```
974We currently support models from the Qwen3, Qwen3-VL, and Llama3 series. We'll use Qwen3-VL-30B-A3B-Instruct for these examples, as it's a vision-language model that can also handle text-only tasks. See [Available Models in Tinker](/model-lineup) for the full list.
975 
976Now we can create the `TrainingClient`:
977```python
978base_model = "Qwen/Qwen3-VL-30B-A3B-Instruct"
979training_client = service_client.create_lora_training_client(
980    base_model=base_model
981)
982```
983As the name suggests, this model was already finetuned for chat/instruction-following. You should check the details of the model you're using in their system cards.
984 
985## Preparing the training data
986 
987Now we can do training updates on the model. This quickstart example won't show best practices for LLM fine-tuning; it's just an API demo. Check out [Rendering](/rendering), [Supervised Fine-tuning](/supervised-learning) and the other Cookbook examples for guidance on how to use Tinker in real applications.
988 
989For this model, we'll train a model that can translate words into Pig Latin. The rules for Pig Latin are simple:
990- If a word begins with a consonant, move it to the end and add "ay"
991- If a word begins with a vowel, just add "way" to the end
992 
993Here are some example completions we'd like the model to perform, where the prompt is in green and the model's completion is in red:
994 
995<div className="example">
996<span className="prompt">English: hello world<br/>
997Pig Latin: </span><span className="completion">ello-hay orld-way</span>
998</div>
999 
1000Let's create some training examples and convert them to a format expected by Tinker.
1001 
1002```python
1003# Create some training examples
1004examples = [
1005    {
1006        "input": "banana split",
1007        "output": "anana-bay plit-say"
1008    },
1009    {
1010        "input": "quantum physics",
1011        "output": "uantum-qay ysics-phay"
1012    },
1013    {
1014        "input": "donut shop",
1015        "output": "onut-day op-shay"
1016    },
1017    {
1018        "input": "pickle jar",
1019        "output": "ickle-pay ar-jay"
1020    },
1021    {
1022        "input": "space exploration",
1023        "output": "ace-spay exploration-way"
1024    },
1025    {
1026        "input": "rubber duck",
1027        "output": "ubber-ray uck-day"
1028    },
1029    {
1030        "input": "coding wizard",
1031        "output": "oding-cay izard-way"
1032    },
1033]
1034 
1035# Convert examples into the format expected by the training client
1036from tinker import types
1037 
1038# Get the tokenizer from the training client
1039tokenizer = training_client.get_tokenizer()
1040 
1041def process_example(example: dict, tokenizer) -> types.Datum:
1042    # Format the input with Input/Output template
1043    # For most real use cases, you'll want to use a renderer / chat template,
1044    # (see later docs) but here, we'll keep it simple.
1045    prompt = f"English: {example['input']}\nPig Latin:"
1046 
1047    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
1048    prompt_weights = [0] * len(prompt_tokens)
1049    # Add a space before the output string, and finish with double newline
1050    completion_tokens = tokenizer.encode(f" {example['output']}\n\n", add_special_tokens=False)
1051    completion_weights = [1] * len(completion_tokens)
1052 
1053    tokens = prompt_tokens + completion_tokens
1054    weights = prompt_weights + completion_weights
1055 
1056    input_tokens = tokens[:-1]
1057    target_tokens = tokens[1:] # We're predicting the next token, so targets need to be shifted.
1058    weights = weights[1:]
1059 
1060    # A datum is a single training example for the loss function.
1061    # It has model_input, which is the input sequence that'll be passed into the LLM,
1062    # loss_fn_inputs, which is a dictionary of extra inputs used by the loss function.
1063    return types.Datum(
1064        model_input=types.ModelInput.from_ints(tokens=input_tokens),
1065        loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
1066    )
1067 
1068processed_examples = [process_example(ex, tokenizer) for ex in examples]
1069 
1070# Visualize the first example for debugging purposes
1071datum0 = processed_examples[0]
1072print(f"{'Input':<20} {'Target':<20} {'Weight':<10}")
1073print("-" * 50)
1074for i, (inp, tgt, wgt) in enumerate(zip(datum0.model_input.to_ints(), datum0.loss_fn_inputs['target_tokens'].tolist(), datum0.loss_fn_inputs['weights'].tolist())):
1075    print(f"{repr(tokenizer.decode([inp])):<20} {repr(tokenizer.decode([tgt])):<20} {wgt:<10}")
1076```
1077 
1078The visualization of the first example is:
1079 
1080```
1081Input                Target               Weight
1082--------------------------------------------------
1083'English'            ':'                  0.0
1084':'                  ' I'                 0.0
1085' I'                 ' love'              0.0
1086' love'              ' tink'              0.0
1087' tink'              'ering'              0.0
1088'ering'              '\n'                 0.0
1089'\n'                 'P'                  0.0
1090'P'                  'ig'                 0.0
1091'ig'                 ' Latin'             0.0
1092' Latin'             ':'                  0.0
1093':'                  ' I'                 1.0
1094' I'                 '-way'               1.0
1095'-way'               ' o'                 1.0
1096' o'                 've'                 1.0
1097've'                 '-l'                 1.0
1098'-l'                 'ay'                 1.0
1099'ay'                 ' ink'               1.0
1100' ink'               'ering'              1.0
1101'ering'              '-t'                 1.0
1102'-t'                 'ay'                 1.0
1103'ay'                 '<|endoftext|>'      1.0
1104```
1105 
1106## Vision inputs
1107 
1108The above example is text-only, but adding vision inputs is also straightforward. The `ModelInput` type takes a list of chunks, which can be either `EncodedTextChunk` or `ImageChunk`. For instance:
1109 
1110```python
1111image_data = requests.get("https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png").content
1112model_input = tinker.ModelInput(chunks=[
1113  types.EncodedTextChunk(tokens=tokenizer.encode("<|im_start|>user\n<|vision_start|>")),
1114  types.ImageChunk(data=image_data, format="png"),
1115  types.EncodedTextChunk(tokens=tokenizer.encode("<|vision_end|>What is this?<|im_end|>\n<|im_start|>assistant\n")),
1116])
1117```
1118 
1119Note that Qwen3-VL was trained with special tokens like `<|vision_start|>` and `<|vision_end|>`. The cookbook's `Qwen3VLRenderer` handles these automatically—see [Rendering: Vision Inputs](/rendering#vision-inputs) for details and a complete example.
1120 
1121## Performing a training update
1122 
1123Now we can use this data to perform a training update. We'll do 6 updates on the same batch of data. (Note that this is not typically a good way to train!)
1124 
1125```python
1126import numpy as np
1127for _ in range(6):
1128    fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
1129    optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
1130 
1131    # Wait for the results
1132    fwdbwd_result = fwdbwd_future.result()
1133    optim_result = optim_future.result()
1134 
1135    # fwdbwd_result contains the logprobs of all the tokens we put in. Now we can compute the weighted
1136    # average log loss per token.
1137    logprobs = np.concatenate([output['logprobs'].tolist() for output in fwdbwd_result.loss_fn_outputs])
1138    weights = np.concatenate([example.loss_fn_inputs['weights'].tolist() for example in processed_examples])
1139    print(f"Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}")
1140```
1141 
1142Note that the `forward_backward` and `optim_step` functions immediately return *futures*, which acknowledge that the task has been queued up by the server. For improved speed, we submitted both operations before waiting for the result by calling `result()` on the futures.
1143 
1144 
1145## Sampling from the model
1146 
1147Now we can test our model by sampling from it. In this case, we'll translate the phrase "coffee break" into Pig Latin.
1148 
1149```python
1150# First, create a sampling client. We need to transfer weights
1151sampling_client = training_client.save_weights_and_get_sampling_client(name='pig-latin-model')
1152 
1153# Now, we can sample from the model.
1154prompt = types.ModelInput.from_ints(tokenizer.encode("English: coffee break\nPig Latin:"))
1155params = types.SamplingParams(max_tokens=20, temperature=0.0, stop=["\n"]) # Greedy sampling
1156future = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=8)
1157result = future.result()
1158print("Responses:")
1159for i, seq in enumerate(result.sequences):
1160    print(f"{i}: {repr(tokenizer.decode(seq.tokens))}")
1161```
1162 
1163Since sampling is nondeterministic (sadly, even with temperature=0.0, [due to batching](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)), the output will be different each time. You should see something like this:
1164 
1165```
1166Responses:
11670: ' offe-bay eak-bay\n\n'
11681: ' offey-coy eak-bray\n\n'
11692: ' offecay eakbray\n\n'
11703: ' offeec-cay eak-brcay\n\n\n'
11714: ' offecay akebay\n\n'
11725: ' offee-Cay ake-bay\n\n\n'
11736: ' offey-pay eak-bray\n\n'
11747: ' offee – cay eak – bray\n\n'
1175```
1176 
1177### Computing logprobs for a sequence
1178 
1179We can use the sampler to compute logprobs for a given sequence as well. This uses the prefill step and is returned as _prompt logprobs_.
1180 
1181```python
1182prompt = types.ModelInput.from_ints(tokenizer.encode("How many r's are in the word strawberry?"))
1183sample_response = sampling_client.sample(
1184    prompt=prompt,
1185    num_samples=1,
1186    sampling_params=tinker.SamplingParams(max_tokens=1),  # Must be at least 1 token, represents prefill step
1187    include_prompt_logprobs=True,
1188).result()
1189 
1190# example: [None, -9.54505, -1.64629, -8.81116, -3.50217, -8.25927, ...]
1191print(sample_response.prompt_logprobs)
1192```
1193 
1194The first logprob is `None` (corresponding to the first token), and subsequent entries are logprobs of each token in the prompt.
1195 
1196The sampling client also has a helper function, which is the same as above:
1197 
1198```python
1199sampling_client.compute_logprobs(prompt).result()
1200```
1201 
1202### Top-k logprobs
1203 
1204For distillation, it may be especially useful to compute _top-k logprobs_ for each token as well, which can get you a sense for what the model "would have said" after each prefix instead of the actual prompt.
1205 
1206```python
1207sample_response = sampling_client.sample(
1208    prompt=prompt,
1209    num_samples=1,
1210    sampling_params=tinker.SamplingParams(max_tokens=1),
1211    include_prompt_logprobs=True,
1212    topk_prompt_logprobs=5,
1213).result()
1214 
1215# example:
1216# [None,
1217#  [(14924, -1.17005), (755, -2.23255), (2, -2.73255), (791, -3.67005), (16309, -4.29505)],
1218#  [(25, -1.64629), (3137, -2.39629), (11630, -2.89629), (21460, -3.83379), (14881, -4.02129)],
1219#  [(41, -3.49866), (42, -3.49866), (49, -4.24866), (38, -4.37366), (54, -4.49866)],
1220#  [(311, -1.00217), (656, -2.25217), (2057, -2.75217), (649, -3.25217), (10470, -3.37717)],
1221#  ...]
1222sample_response.topk_prompt_logprobs
1223```
1224 
1225For each position in the response, this returns a list of `(token_id, logprob)` pairs for the top-k most likely tokens at that position.
1226 
1227## Putting it together: Sampling from an image
1228 
1229Here's a complete example that creates a training client, saves weights for sampling, and asks a question about an image. You can copy-paste it into an iPython notebook:
1230 
1231```python
1232import requests
1233import tinker
1234from transformers import AutoTokenizer
1235 
1236model_name = "Qwen/Qwen3-VL-30B-A3B-Instruct"
1237tokenizer = AutoTokenizer.from_pretrained(model_name)
1238 
1239service_client = tinker.ServiceClient()
1240training_client = await service_client.create_lora_training_client_async(base_model=model_name, rank=32)
1241sampling_client = await training_client.save_weights_and_get_sampling_client_async(name="sampler")
1242 
1243# Grab an image and ask a question
1244image_data = requests.get("https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png").content
1245model_input = tinker.ModelInput(chunks=[
1246    tinker.types.EncodedTextChunk(tokens=tokenizer.encode("<|im_start|>user\n<|vision_start|>")),
1247    tinker.types.ImageChunk(data=image_data, format="png"),
1248    tinker.types.EncodedTextChunk(tokens=tokenizer.encode("<|vision_end|>What is this?<|im_end|>\n<|im_start|>assistant\n")),
1249])
1250 
1251result = await sampling_client.sample_async(prompt=model_input, num_samples=1, sampling_params=tinker.types.SamplingParams(max_tokens=100))
1252print(tokenizer.decode(result.sequences[0].tokens))
1253```
1254 
1255For higher-level abstractions that handle special tokens automatically, see [Rendering: Vision Inputs](/rendering#vision-inputs).
1256 
1257 
1258---
1259 
1260## File: rendering.mdx
1261 
1262import { CookbookLink } from '../components/CookbookLink'
1263 
1264 
1265# Rendering to tokens
1266 
1267Rendering converts list-of-message datatypes into their token representations for model training and inference. While similar to [chat templates](https://huggingface.co/docs/transformers/en/chat_templating), Tinker's rendering system is designed for the full training lifecycle--not just inference--supporting supervised learning, reinforcement learning, and deployment.
1268 
1269 
1270## The Renderer class
1271 
1272The Renderer class is the main interface used for rendering. It can be found in <CookbookLink path="tinker_cookbook/renderers.py">`renderers.py`</CookbookLink>.
1273 
1274**Example conversation:**
1275 
1276```python
1277messages =[
1278    {'role': 'system', 'content': 'Answer concisely; at most one sentence per response'},
1279    {'role': 'user', 'content': 'What is the longest-lived rodent species?'},
1280    {'role': 'assistant', 'content': 'The naked mole rat, which can live over 30 years.'},
1281    {'role': 'user', 'content': 'How do they live so long?'},
1282    {'role': 'assistant', 'content': 'They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.'}
1283]
1284```
1285 
1286We'll use this conversation throughout the examples below.
1287 
1288## Inference: Generating messages
1289 
1290Our model maps tokens to tokens, but with the renderer, it can map messages to messages. To sample messages from the model, we need to use three methods from the renderer:
1291 
1292- `build_generation_prompt`
1293- `get_stop_sequences`
1294- `parse_response`
1295 
1296 
1297`build_generation_prompt` converts a conversation into a prompt that we can use to sample from the assistant. This is used during reinforcement learning and at deployment time.
1298 
1299 
1300**Example: Generate an alternative assistant response**
1301 
1302Let's remove the last assistant message and call `build_generation_prompt` to get a prompt that we can use to sample an alternative response from the assistant:
1303 
1304```python
1305from tinker_cookbook import renderers, tokenizer_utils
1306tokenizer = tokenizer_utils.get_tokenizer('Qwen/Qwen3-30B-A3B')
1307renderer = renderers.get_renderer('qwen3', tokenizer)
1308prompt = renderer.build_generation_prompt(messages[:-1])
1309print(prompt)
1310print('-'*10)
1311print(tokenizer.decode(prompt.to_ints()))
1312```
1313 
1314**Output:**
1315```
1316ModelInput(chunks=[EncodedTextChunk(tokens=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 8948, 198, 16141, 3529, 285, 974, 26, 518, 1429, 825, 11652, 817, 2033, 151645, 198, 151644, 872, 198, 3838, 374, 279, 22032, 61854, 20589, 306, 9419, 30, 151645, 198, 151644, 77091, 198, 785, 19020, 34651, 11244, 11, 892, 646, 3887, 916, 220, 18, 15, 1635, 13, 151645, 198, 151644, 872, 198, 10234, 30, 151645, 198, 151644, 77091, 198], type='encoded_text')])
1317----------
1318<|im_start|>system
1319Answer concisely; at most one sentence per response<|im_end|>
1320<|im_start|>user
1321What is the longest-lived rodent species?<|im_end|>
1322<|im_start|>assistant
1323The naked mole rat, which can live over 30 years.<|im_end|>
1324<|im_start|>user
1325How do they live so long?<|im_end|>
1326<|im_start|>assistant
1327 
1328```
1329 
1330You can see that the prompt is a `ModelInput` object, which is a list of `EncodedTextChunk` objects (but contains different objects in multi-modal data).
1331 
1332 
1333**Sampling and parsing the response:**
1334 
1335Given that we're providing messages as input, we probably want a message output, rather than a token output. For that, we can use `parse_response`.
1336 
1337```python
1338import tinker
1339from tinker.types import SamplingParams
1340service_client = tinker.ServiceClient()
1341sampling_client = service_client.create_sampling_client(base_model='Qwen/Qwen3-30B-A3B')
1342stop_sequences = renderer.get_stop_sequences()
1343print(f"Stop sequences: {stop_sequences}")
1344sampling_params = SamplingParams(max_tokens=100, temperature=0.5, stop=stop_sequences)
1345output = sampling_client.sample(prompt, sampling_params=sampling_params, num_samples=1).result()
1346print(f"Sampled tokens: {output.sequences[0].tokens}")
1347sampled_message, parse_success = renderer.parse_response(output.sequences[0].tokens)
1348print(f"Sampled message: {sampled_message}")
1349print(f"Parse success: {parse_success}")
1350```
1351 
1352**Output:**
1353 
1354```
1355Stop sequences: [151645]
1356Sampled tokens: [45, 7741, 34651, 31410, 614, 4911, 76665, 11, 2670, 264, 7548, 11050, 22077, 1849, 323, 264, 1602, 3347, 40761, 4379, 11, 892, 16792, 311, 862, 57119, 13, 151645]
1357Sampled message: {'role': 'assistant', 'content': 'Naked mole rats have unique adaptations, including a highly efficient immune system and a very low metabolic rate, which contribute to their longevity.'}
1358Parse success: True
1359```
1360 
1361You can see that the there is one stop sequence, `151645`, which you can verify is the `<|im_end|>` token. The output is parsed successfully into a message.
1362 
1363 
1364## Training: Supervised learning
1365 
1366For supervised learning (and some other algorithms like [DPO](/preferences/dpo-guide)), we need to distinguish between **prompt tokens** (context) and **completion tokens** (what the model should learn to generate). We want to provide a target assistant message, and the renderer needs to tell us which tokens are part of the prompt and completion.
1367 
1368We can use `build_supervised_example` to get a `ModelInput` and per-token loss weights:
1369 
1370```python
1371model_input, weights = renderer.build_supervised_example(messages)
1372 
1373from tinker_cookbook.utils.format_colorized import format_colorized
1374print(format_colorized(model_input.to_ints(), weights, tokenizer))
1375```
1376 
1377We get the following output:
1378 
1379<div className="example">
1380<span className="prompt">&lt;|im_start|&gt;system↵<br />Answer concisely; at most one sentence per response&lt;|im_end|&gt;↵<br />&lt;|im_start|&gt;user↵<br />What is the longest-lived rodent species?&lt;|im_end|&gt;↵<br />&lt;|im_start|&gt;assistant↵<br />The naked mole rat, which can live over 30 years.&lt;|im_end|&gt;↵<br />&lt;|im_start|&gt;user↵<br />How do they live so long?&lt;|im_end|&gt;↵<br />&lt;|im_start|&gt;assistant↵<br /></span>
1381<span className="completion">They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.&lt;|im_end|&gt;<br /></span>
1382</div>
1383The green text is part of the prompt (i.e. with `weight=0`, so no loss is computed on these) and red is part of the completion (i.e. with `weight=1`, so the model is trained to predict these). Note that the ↵ have been inserted for clarity to show newlines; these are not actually part of the token sequence.
1384 
1385The key insight here is that only the final assistant message is treated as the completion. All previous context, including the first assistant response, is part of the prompt, so the model learns to continue conversations rather than just answer single questions.
1386 
1387## Vision Inputs
1388 
1389Tinker supports vision-language models (VLMs) like `Qwen/Qwen3-VL-30B-A3B-Instruct` and `Qwen/Qwen3-VL-235B-A22B-Instruct`. For low-level `ImageChunk` usage, see [Vision inputs](/training-sampling#vision-inputs) in the Training and Sampling guide. This section covers the higher-level message abstractions.
1390 
1391### Multimodal messages
1392 
1393For VLMs, message content can be either a string or a list of content parts:
1394 
1395```python
1396from tinker_cookbook.renderers import Message, TextPart, ImagePart
1397 
1398# Text-only message (standard)
1399text_message = Message(role='user', content='What is this?')
1400 
1401# Multimodal message with image
1402multimodal_message = Message(
1403    role='user',
1404    content=[
1405        ImagePart(type='image', image='https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'),
1406        TextPart(type='text', text='What is in this image?'),
1407    ]
1408)
1409```
1410 
1411For lower-level control using `ImageChunk` directly, see [Vision inputs](/training-sampling#vision-inputs) in the Training and Sampling guide.
1412 
1413### Using Qwen3VLRenderer
1414 
1415The `Qwen3VLRenderer` handles Qwen's vision special tokens (`<|vision_start|>`, `<|vision_end|>`) automatically:
1416 
1417```python
1418from tinker_cookbook import renderers, tokenizer_utils
1419from tinker_cookbook.image_processing_utils import get_image_processor
1420 
1421model_name = "Qwen/Qwen3-VL-235B-A22B-Instruct"
1422tokenizer = tokenizer_utils.get_tokenizer(model_name)
1423image_processor = get_image_processor(model_name)
1424 
1425renderer = renderers.Qwen3VLRenderer(tokenizer, image_processor)
1426 
1427messages = [
1428    {
1429        'role': 'user',
1430        'content': [
1431            {'type': 'image', 'image': 'https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'},
1432            {'type': 'text', 'text': 'What is in this image?'},
1433        ]
1434    }
1435]
1436 
1437prompt = renderer.build_generation_prompt(messages)
1438```
1439 
1440For a complete example of training a VLM image classifier, see the <CookbookLink path="tinker_cookbook/recipes/vlm_classifier">VLM Classifier recipe</CookbookLink> in the cookbook.
1441 
1442## Multi-turn RL and the Extension Property
1443 
1444When using renderers in multi-turn RL, an important consideration is whether consecutive timesteps satisfy the **extension property**—where each observation is a prefix extension of the previous observation plus action. This affects compute efficiency (O(T) vs O(T^2)) and KV-cache reuse.
1445 
1446Some renderers, like `Qwen3Renderer`, have options that affect this property. For example, `strip_thinking_from_history` controls whether `<think>` blocks are preserved in conversation history.
1447 
1448See the [Sequence Extension](/rl/sequence-extension) documentation for details on how this works and the tradeoffs involved.
1449 
1450## Appendix: Why not Jinja templates?
1451 
1452In our experience, the Jinja2 templates are harder to write than Python code, especially when we need to get the whitespace exactly right. They are also unwieldy for supervised learning, where you need to put different labels on different tokens.
1453 
1454 
1455---
1456 
1457## File: completers.mdx
1458 
1459import { CookbookLink } from '../components/CookbookLink'
1460 
1461# Completers
1462 
1463The concept of policies is crucial to the RL training process. In the Tinker Cookbook, policies are implemented as `Completers`. Completers are abstractions that represent models or policies that can be sampled from, providing different levels of structure depending on your use case.
1464 
1465## Overview of Completer Types
1466 
1467The Tinker Cookbook provides two main types of completers, each designed for different use cases:
1468 
14691. **TokenCompleter**: Operates on tokens and is used by RL algorithms
14702. **MessageCompleter**: Operates on messages and needs to be used with a renderer
1471 
1472The choice between these depends on whether you're working at the token level for RL training or at the message level for interacting with and evaluating the model.
1473 
1474### TokenCompleter
1475 
1476The `TokenCompleter` is the foundational interface used by RL algorithms because they work directly with tokens.
1477 
1478```python
1479class TokenCompleter:
1480    async def __call__(
1481        self, model_input: types.ModelInput, stop: StopCondition
1482    ) -> TokensWithLogprobs:
1483```
1484 
1485This interface takes:
1486- `model_input`: The input to the model (of type `types.ModelInput`)
1487- `stop`: Stop conditions, either a list of strings or token IDs (combined into a `StopCondition` class). When training with reinforcement learning, this should be defined by the `initial_observation` function of the environment.
1488 
1489It returns a `TokensWithLogprobs` object containing:
1490- `tokens`: The generated token sequence
1491- `maybe_logprobs`: Optional log probabilities for each token
1492 
1493### MessageCompleter
1494 
1495The `MessageCompleter` operates at a higher level with structured messages, similarly to standard chat APIs. It takes a list of messages and returns a single assistant message response.
1496 
1497```python
1498class MessageCompleter:
1499    async def __call__(self, messages: list[renderers.Message]) -> renderers.Message:
1500```
1501 
1502For training purposes the `TokenCompleter` is the class we will use for RL training as we need to optimize the same same set of tokens during the update step that the model output during rollout. The `MessageCompleter` is useful for sampling where we need to use the model output for semantic purposes such as Judge models or multi-agent environments.
1503 
1504The Tinker Cookbook uses two concrete implementations of these interfaces - <CookbookLink path="tinker_cookbook/completers.py">`TinkerTokenCompleter`</CookbookLink> and <CookbookLink path="tinker_cookbook/completers.py">`TinkerMessageCompleter`</CookbookLink> which are both wrappers around a `tinker.SamplingClient`. While the TinkerTokenCompleter operates directly on tokens, the TinkerMessageCompleter needs to be instantiated with a renderer to make it compatible with the inputs expected by the samping client.
1505 
1506 
1507---
1508 
1509## File: install.mdx
1510 
1511# Installing Tinker
1512 
1513Install the Tinker SDK with:
1514 
1515```bash
1516pip install tinker
1517```
1518 
1519Installation makes two components available: the python SDK and the tinker CLI.
1520 
1521#### Python SDK
1522 
1523The python SDK provides low-level operations like `forward_backward`, `sample`, `optim_step`, and `save_state`.
1524 
1525#### Tinker CLI
1526 
1527The tinker CLI is available as `tinker` or through `python -m tinker`. The CLI provides management functionality similar to that of the web console.
1528 
1529Run `tinker --help` to see which functionality is available.
1530 
1531## Tinker Cookbook
1532 
1533We also release [tinker-cookbook](https://github.com/thinking-machines-lab/tinker-cookbook), which is a collection of training code and experiment tools built on top of Tinker.
1534For the Cookbook, we'd recommend doing a local editable install, as you'll probably want to browse and edit the code:
1535 
1536```bash
1537git clone https://github.com/thinking-machines-lab/tinker-cookbook.git
1538cd tinker-cookbook
1539# Switch to your virtual environment
1540pip install -e .
1541```
1542 
1543## Getting an API key
1544 
1545Create an API key from the [console](https://tinker-console.thinkingmachines.ai). You'll then want to set the `TINKER_API_KEY` environment variable to your newly generated API key.
1546 
1547 
1548---
1549 
1550## File: rl.mdx
1551 
1552import { CookbookLink } from '../components/CookbookLink'
1553 
1554# Reinforcement learning
1555 
1556Reinforcement learning (RL) means learning from trial and error. Whereas in supervised learning, we're given input-output pairs, in RL, we're given inputs (prompts) and reward functions (i.e., a function for scoring candidate outputs). RL algorithms need to discover what good outputs look like.
1557 
1558Here are a few different types of RL training that we support in the Tinker Cookbook:
1559 
1560- *RL with Verifiable Rewards*: this is when we do RL on a reward function that checks model outputs using a program. Typically, the reward function checks the candidate answer against a reference answer, or, in coding cases, it may check if the candidate solution passes some unit tests. RLVR is especially suitable for teaching models to do reasoning (with chain-of-thought) and multi-step tool use (e.g., debugging and iterative modification pf programs).
1561- *RL on Human Feedback*: here, we assume we have an objective that can't be calculated by a simple program, and it requires some human judgement. For example, we typically want to optimize our models for helpfulness, which includes being clear, informative, and interesting. For RLHF, we train a *preference model* using supervised learning to match human judgement, scoring or ranking candidate outputs. Then we do RL on the preference model's scores. See the [Preferences](/preferences) section for more details.
1562 
1563We'll first show how to do small RL runs in the RLVR setting, then we'll show you how to define your own RL environments and train on them, then we'll provide examples for larger-scale or more complicated training setups.
1564 
1565 
1566We anticipate that people will want to use Tinker for RL in a few different ways:
1567 
1568- Creating a specialist model that's SoTA at a specific skill, which existing models haven't been trained on. In this case, you'll want to start with a post-trained model that's already strong, and then do RL on an environment you've defined. See [RL Environments](/rl/rl-envs).
1569- Doing research on post-training pipelines. In this case, you'll probably want to chain together SL and RL and runs with different data mixes, environments, and reward functions. See our [RLHF example](/preferences/rlhf-example).
1570- Doing research on RL algorithms. Here, you'll probably want to find some existing environments to use as benchmarks, and either modify our provided training code (<CookbookLink path="tinker_cookbook/rl/train.py">rl/train.py</CookbookLink>) or write your own minimal training loop. We've provided a [minimal training loop](/rl/rl-loops) that you can use as a starting point.
1571 
1572 
1573---
1574 
1575## File: under-the-hood.mdx
1576 
1577# Under the Hood
1578 
1579This page explains some implementation details of Tinker, which are important for understanding how to speed up your code.
1580 
1581## Clock Cycles
1582 
1583In Tinker, after you call `ServiceClient.create_lora_training_client`, your training job gets assigned to a pool of machines that working together -- a *worker pool* -- which are doing forward-backward operations repeatedly in lock-step.
1584Each of these steps of the worker pool is called a *clock cycle*.
1585In each clock cycle, we do forward-backward and an optimizer step operation, each of which may involve multiple LoRA models that are being trained by this pool.
1586You can think of this pool as a single large training run that is time-shared between multiple different LoRA models, often from different users.
1587 
1588With multi-tenancy -- sharing the same worker pool between multiple models -- we can run the training system efficiently even if users are training with small batch sizes, or if they have other delays in their training loops that would otherwise leave the worker pool idle. Small batch sizes can often give better *sample efficiency*, so this setup lets us achieve both high compute efficiency and high sample efficiency.
1589 
1590The downside is that it can sometimes lead to worse *latency*: even if training with a small batch, you'll still see the same step time as a large batch. (Still, note that we'll only charge you for the compute you use.) Also, if your training loop is implemented naively, you might have to wait multiple clock cycles per batch, because you might miss a clock cycle between operations.
1591 
1592### Overlapping `forward_backward` and `optim_step` Requests
1593 
1594As mentioned in the [Async and Futures](/async) section, you should submit your `forward_backward` and `optim_step` requests together before waiting for either of them. This way, they'll end up on the same clock cycle. If you write the code naively, you'll end up using *three* clock cycles per training step. Here's a recap of the example from the [Async and Futures](/async) section:
1595 
1596**❌ Naive implementation (uses 3 clock cycles):**
1597```python
1598# Submit forward_backward, gets queued for clock cycle N
1599fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)
1600 
1601# Wait for it to complete, and for client to receive the result
1602# Due to communication latency, this happens a little after cycle N+1 started
1603fwd_bwd_result = await fwd_bwd_future
1604 
1605# Submit optim_step, gets queued for clock cycle N+2
1606optim_future = await client.optim_step_async(adam_params)
1607 
1608# Wait for it to complete, and for client to receive the result
1609# This happens a little after cycle N+2 finishes
1610optim_result = await optim_future
1611 
1612# Total: forward_backward on cycle N, optim_step on cycle N+2
1613# This takes 3 clock cycles (plus the time we waited before cycle N started)
1614```
1615 
1616**✓ Better implementation (uses 1 clock cycle):**
1617```python
1618# Submit both requests immediately. They'll both be slotted into the same clock cycle N
1619fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)
1620optim_future = await client.optim_step_async(adam_params)
1621 
1622# Now wait for results - both operations happen on cycle N
1623fwd_bwd_result = await fwd_bwd_future
1624optim_result = await optim_future
1625 
1626# Total: both operations on cycle N
1627# This takes 1 clock cycle
1628```
1629 
1630### Pipelining to Maximize Clock Cycle Efficiency
1631 
1632To maximize efficiency and avoid missing clock cycles, you should **pipeline your training loop**: submit the next batch before waiting for the current batch to complete. This ensures there's always a request queued when a new clock cycle starts.
1633 
1634We've created a demonstration script that shows the difference between pipelined and non-pipelined training:
1635 
1636[View the clock cycles demonstration script →](/clock_cycles.py.txt)
1637 
1638The script includes two versions:
1639 
1640- **Non-pipelined**: Submits a batch, waits for it to complete, then submits the next. This approach typically wastes clock cycles because there's a gap between when one batch finishes and the next is submitted, often using 2 clock cycles per training step.
1641 
1642- **Pipelined**: Submits the next batch *before* waiting for the previous batch to complete. This approach often uses exactly 1 clock cycle per step, achieving maximum efficiency. Though it might sometimes take more than 1 clock cycle per step if the server is heavily loaded, or due to subtleties of our current implementation. (For example, if there are no other users, we might start the clock cycle after receiving the first `forward_backward` but before receiving the `optim_step`. Then we'll do `optim_step` on the next cycle. This causes an extra clock cycle but doesn't cause a slowdown.)
1643 
1644Running the script will show you the performance comparison, including total time and clock cycles used. The pipelined version typically saves both time and clock cycles.
1645 
1646 
1647---
1648 
1649## File: model-lineup.mdx
1650 
1651# Available Models in Tinker
1652 
1653The table below shows the models that are currently available in Tinker. We plan to update this list as new models are released.
1654 
1655## What model should I use?
1656 
1657- In general, use MoE models, which are more cost effective than the dense models.
1658- Use Base models only if you're doing research or are running the full post-training pipeline yourself
1659- If you want to create a model that is good at a specific task or domain, use an existing post-trained model model, and fine-tune it on your own data or environment.
1660    - If you care about latency, use one of the Instruction models, which will start outputting tokens without a chain-of-thought.
1661    - If you care about intelligence and robustness, use one of the Hybrid or Reasoning models, which can use long chain-of-thought.
1662 
1663## Full Listing
1664 
1665| Model Name                                                                                      | Training Type | Architecture | Size      |
1666| ----------------------------------------------------------------------------------------------- | ------------- | ------------ | --------- |
1667| [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)     | Vision        | MoE          | Large     |
1668| [Qwen/Qwen3-VL-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct)         | Vision        | MoE          | Medium    |
1669| [Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | Instruction   | MoE          | Large     |
1670| [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)     | Instruction   | MoE          | Medium    |
1671| [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)                                 | Hybrid        | MoE          | Medium    |
1672| [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)                       | Base          | MoE          | Medium    |
1673| [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)                                         | Hybrid        | Dense        | Medium    |
1674| [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)                                           | Hybrid        | Dense        | Small     |
1675| [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)                                 | Base          | Dense        | Small     |
1676| [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)               | Instruction   | Dense        | Compact   |
1677| [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)                               | Reasoning     | MoE          | Medium    |
1678| [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)                                 | Reasoning     | MoE          | Small     |
1679| [deepseek-ai/DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1)                   | Hybrid        | MoE          | Large     |
1680| [deepseek-ai/DeepSeek-V3.1-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base)         | Base          | MoE          | Large     |
1681| [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)                     | Base          | Dense        | Large     |
1682| [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)   | Instruction   | Dense        | Large     |
1683| [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)                       | Base          | Dense        | Small     |
1684| [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)     | Instruction   | Dense        | Small     |
1685| [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)                       | Base          | Dense        | Compact   |
1686| [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)                       | Base          | Dense        | Compact   |
1687| [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking)               | Reasoning     | MoE          | Large     |
1688 
1689## Legend
1690 
1691### Training Types
1692- **Base**: Foundation models trained on raw text data, suitable for post-training research and custom fine-tuning.
1693- **Instruction**: Models fine-tuned for following instructions and chat, optimized for fast inference.
1694- **Reasoning**: Models that always use chain-of-thought reasoning before their "visible" output that responds to the prompt.
1695- **Hybrid**: Models that can operate in both thinking and non-thinking modes, where the non-thinking mode requires using a special renderer or argument that disables chain-of-thought.
1696- **Vision**: Vision-language models (VLMs) that can process images alongside text. See [Vision Inputs](/rendering#vision-inputs) for usage.
1697 
1698### Architecture
1699- **Dense**: Standard transformer architecture with all parameters active
1700- **MoE**: Mixture of Experts architecture with sparse activation
1701 
1702### Model Sizes
1703 
1704- **Compact**: 1B-4B parameters
1705- **Small**: 8B parameters
1706- **Medium**: 30B-32B parameters
1707- **Large**: 70B+ parameters
1708 
1709Note that the MoE models are much more cost effective than the dense models as their cost is proportional to the number of active parameters and not the total number of parameters.
1710 
1711 
1712---
1713 
1714## File: preferences/dpo-guide.mdx
1715 
1716import { Callout } from 'nextra/components'
1717import { CookbookLink } from '../../components/CookbookLink'
1718 
1719# Direct Preference Optimization (DPO)
1720 
1721Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.
1722 
1723## DPO Algorithm Details
1724 
1725The core DPO loss is computed as:
1726 
1727$$
1728\mathcal{L}_{\theta} = -\mathbb{E}_{x, y_\text{chosen}, y_\text{rejected} \sim \mathcal{D}}\left[\log\sigma\left(\beta\log \frac{\pi_{\theta}(y_\text{chosen}|x)}{\pi_{\text{ref}}(y_\text{chosen}|x)} - \beta\log \frac{\pi_{\theta}(y_\text{rejected}|x)}{\pi_{\text{ref}}(y_\text{rejected}|x)}\right)\right]
1729$$
1730 
1731Where:
1732- $\pi_{\theta}$ is the current policy
1733- $\pi_{\text{ref}}$ is the reference model (typically the initial model before DPO training)
1734- $\beta$ is the DPO beta parameter
1735- Where $\mathcal{D}$ is a dataset of prompts $x$, a chosen response $y_{\text{chosen}}$ and a rejected response $y_{\text{rejected}}$
1736 
1737This optimizes the classical constrianed RLHF objective, where the reference model constrains deviation from the initial distribution.
1738 
1739<Callout type="info">
1740**DPO vs RLHF**: DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.
1741</Callout>
1742 
1743 
1744## Running DPO Training
1745 
1746The implementation is in <CookbookLink path="tinker_cookbook/preference/train_dpo.py">train_dpo.py</CookbookLink> with a CLI interface in <CookbookLink path="tinker_cookbook/recipes/preference/dpo/train.py">train.py</CookbookLink>. You can run it from the command line:
1747 
1748```bash
1749python -m tinker_cookbook.recipes.preference.train \
1750    log_path=/tmp/dpo-hhh-experiment \
1751    model_name=meta-llama/Llama-3.2-1B \
1752    dataset=hhh \
1753    renderer_name=role_colon \
1754    learning_rate=1e-5 \
1755    dpo_beta=0.1
1756```
1757 
1758### Key Parameters
1759 
1760- `log_relpath`: Directory where results and checkpoints are saved
1761- `model_name`: Base model used as initialization and for the reference policy
1762- `dataset`: Dataset name (`hhh`, `helpsteer3`, `ultrafeedback`)
1763- `renderer_name`: How conversations are formatted (see [Rendering](../rendering.mdx))
1764- `learning_rate`: Learning rate for optimization
1765- `dpo_beta`: DPO beta parameter (controls the strength of preference learning)
1766 
1767### Available Datasets
1768 
1769There are several pre-defined datasets:
1770 
1771- **`hhh`**: Anthropic's Helpful-Harmless-Honest dataset
1772- **`helpsteer3`**: NVIDIA's HelpSteer3 preference dataset
1773- **`ultrafeedback`**: UltraFeedback binarized preferences dataset
1774 
1775These are implemented as `DPODatasetBuilder` classes and you can implement a custom dataset builder following the `tinker_cookbook.preference.preference_datasets` interface.
1776 
1777## Training Process
1778 
1779During training, you'll see output like this showing the DPO metrics:
1780 
1781```
1782                   Step 50
1783┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
1784┃ Metric                         ┃ Value     ┃
1785┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
1786│ accuracy                       │ 0.568627  │
1787│ batch_time                     │ 27.953704 │
1788│ chosen_reward                  │ 0.053621  │
1789│ dpo_loss                       │ 0.683825  │
1790│ learning_rate                  │ 0.000009  │
1791│ margin                         │ 0.002147  │
1792│ num_pairs                      │ 255       │
1793│ num_tokens                     │ 112638    │
1794│ progress                       │ 0.081210  │
1795│ rejected_reward                │ 0.032152  │
1796│ test/nll                       │ 1.871778  │
1797└────────────────────────────────┴───────────┘
1798```
1799 
1800The key metrics are:
1801- **`dpo_loss`**: The DPO classification loss
1802- **`accuracy`**: Accuracy of the implicit reward model evaluated on the preference dataset
1803- **`margin`**: Average difference between chosen and rejected rewards
1804- **`chosen_reward`/`rejected_reward`**: Average rewards for chosen/rejected responses
1805 
1806## Evaluating DPO Models
1807 
1808After training, you can evaluate your DPO model using the inspect evaluation framework:
1809 
1810```bash
1811MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
1812python -m tinker_cookbook.eval.run_inspect_evals \
1813    model_path=$MODEL_PATH \
1814    model_name=meta-llama/Llama-3.2-1B \
1815    tasks=inspect_evals/ifeval \
1816    renderer_name=role_colon
1817```
1818 
1819This will evaluate the model on various benchmarks to measure the impact of preference optimization.
1820 
1821## Tips for DPO Training
1822 
18231. **Beta Parameter**: Start with `dpo_beta=0.1` and adjust based on your dataset.
1824 
18252. **Learning Rate**: Use a lower learning rate than supervised fine-tuning (typically 1e-5 to 1e-6).
1826 
18273. **Base Model**: The base model should already be in-distribution with the preference data. Either start with a ligh SFT phase or collect on-policy preferences. While training would still work. sharp distribution mis-match will create strange model behaviors.
1828 
1829 
1830---
1831 
1832## File: preferences/rlhf-example.mdx
1833 
1834import { CookbookLink } from '../../components/CookbookLink'
1835 
1836# Reinforcement Learning from Human Feedback
1837 
1838We've provided a script that shows how to run a standard pipeline for reinforcement learning from human feedback (RLHF) in <CookbookLink path="tinker_cookbook/recipes/preference/rlhf/rlhf_pipeline.py">rlhf_pipeline.py</CookbookLink>.
1839 
1840```bash
1841python -m recipes.preference.rlhf.rlhf_pipeline
1842```
1843 
1844## Training the initial policy via supervised learning
1845 
1846First, we train the policy on the [no_robots dataset](https://huggingface.co/datasets/HuggingFaceH4/no_robots) from Huggingface, which is a basic instruction following dataset with human-written answers, which was designed to match the methodology from [InstructGPT](https://arxiv.org/abs/2203.02155).
1847 
1848 
1849## Training the preference model via supervised learning
1850 
1851We train the preference model on the [HHH dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) from Anthropic, which is a dataset of pairwise comparisons of completions. We train a model that sees a pair of completions, A and B, and outputs which one is preferred.
1852 
1853## Training the policy via reinforcement learning
1854 
1855Taking the initial policy, and the preference model we just trained, we can now train the policy via reinforcement learning. This RL is a form of self-play, where we use the preference model to grade match-ups between the policy and itself. In particular, for each prompt, we sample multiple completions, and use the preference model to grade all pairs of completions. We then give the policy a reward based on the win fraction.
1856 
1857 
1858---
1859 
1860## File: rl/rl-basic.mdx
1861 
1862import { CookbookLink } from '../../components/CookbookLink'
1863 
1864# Your First RL Run
1865 
1866We've provided a minimal script that runs RL on the [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k): <CookbookLink path="tinker_cookbook/recipes/rl_basic.py">rl_basic.py</CookbookLink>. You can run the minimal RL script from the command line as follows:
1867 
1868```bash
1869python -m tinker_cookbook.recipes.rl_basic
1870```
1871 
1872This script will fine-tune the Llama-3.1-8B base (pretrained) model on this dataset with the following reward function:
1873 
1874$$
18751[\text{answer is correct}] + 0.1 \times (1[\text{answer is formatted correctly}] - 1)
1876$$
1877 
1878The training should take about 1 minute per iteration and climb to about 63% accuracy after 15 iterations (`env/all/correct`). You can look at the printouts for some other metrics of interest:
1879 
1880- `ac_tokens_per_turn`: the number of each tokens in each generated completion
1881- `env/all/format`: the fraction of completions that are formatted correctly
1882- `env/all/reward/total`: mean total reward (combining format and correctness as defined above)
1883- `entropy`: per-token entropy (mean negative log-probability of sampled tokens)
1884- `kl_sample_train_{v1,v2}`: two different approximations/estimators of KL divergence between the sampler's and learner's probability distribution (contributed to by numerical differences and rounding noise)
1885- `progress/done_frac`: what fraction of the total number of iterations we've completed so far
1886- `time/...`: time for different parts of the training loop
1887 
1888You can also look at the `log_path` directory for more detailed metrics. There are several files of interest, which are mostly the same as in the [Supervised Learning](/supervised-learning/sl-basic) case.
1889 
1890 
1891---
1892 
1893## File: rl/sequence-extension.mdx
1894 
1895import { CookbookLink } from '../../components/CookbookLink'
1896 
1897# Sequence Extension Property in Multi-Turn RL
1898 
1899When running reinforcement learning with multi-turn conversations, the way you render observations at each timestep has important implications for compute efficiency. This document explains the **extension property** and how it affects training and sampling.
1900 
1901## What is the Extension Property?
1902 
1903A sequence of observations has the **extension property** if each successive observation contains all previous observations and actions as a prefix. In other words, the context grows monotonically by appending new tokens to the end.
1904 
1905When this property holds, multiple timesteps can be merged into a single training datum, the KV-cache can be reused during sampling, and compute scales as O(T) rather than O(T^2) for a trajectory of length T.
1906 
1907## Example 1: Qwen3 with Thinking Visible (Extension Holds)
1908 
1909When using `Qwen3Renderer` with `strip_thinking_from_history=False`, the full conversation history (including `<think>` blocks) is preserved at each timestep. Consider a two-turn math conversation:
1910 
1911**Timestep 1:**
1912<div className="example">
1913<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 4<br/><br/>User:</span>
1914</div>
1915 
1916**Timestep 2:**
1917<div className="example">
1918<span className="prompt">User: What is 2+2?<br/><br/>Assistant: &lt;think&gt;Let me calculate...&lt;/think&gt; 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 6<br/><br/>User:</span>
1919</div>
1920 
1921Notice that the observation (green) at timestep 2 contains the entire timestep 1 sequence as a prefix. The new observation just appends `What is 3+3?\n\nAssistant: ` to the end. This is the **extension property**.
1922 
1923Because extension holds, the RL code can merge both timesteps into a **single Datum**:
1924 
1925<div className="example">
1926<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 4<br/><br/>User:</span><span className="prompt"> What is 3+3?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 6<br/><br/>User:</span>
1927</div>
1928 
1929Green = observation tokens (loss weight = 0). Red = action tokens (loss weight > 0).
1930 
1931## Example 2: Qwen3 with Thinking Hidden (Extension Breaks)
1932 
1933When using `Qwen3Renderer` with the default `strip_thinking_from_history=True`, the `<think>...</think>` blocks are stripped from previous assistant messages. This matches how Qwen3 models were post-trained by the Qwen team.
1934 
1935**Timestep 1:**
1936<div className="example">
1937<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 4<br/><br/>User:</span>
1938</div>
1939 
1940**Timestep 2:**
1941<div className="example">
1942<span className="prompt">User: What is 2+2?<br/><br/>Assistant: 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 6<br/><br/>User:</span>
1943</div>
1944 
1945The observation at timestep 2 is **not** an extension of timestep 1's full sequence. The `<think>Let me calculate...</think>` portion was stripped, so the prefix doesn't match. The RL code must create **two separate Datums**:
1946 
1947**Datum 1:**
1948<div className="example">
1949<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 4<br/><br/>User:</span>
1950</div>
1951 
1952**Datum 2:**
1953<div className="example">
1954<span className="prompt">User: What is 2+2?<br/><br/>Assistant: 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion">&lt;think&gt;Let me calculate...&lt;/think&gt; 6<br/><br/>User:</span>
1955</div>
1956 
1957This results in more compute during training (two forward/backward passes instead of one) and prevents KV-cache reuse during sampling. For a trajectory of T timesteps, compute scales as O(T²) instead of O(T).
1958 
1959## The Tradeoff
1960 
1961**Keeping thinking visible** (`strip_thinking_from_history=False`) gives you O(T) compute scaling, allows packing sequences together in training batches, and enables KV-cache reuse during sampling. The downside is that context grows faster since all thinking tokens are retained, so you may hit context length limits sooner.
1962 
1963**Stripping thinking** (`strip_thinking_from_history=True`, the default) keeps context smaller but breaks the extension property, leading to O(T²) compute scaling.
1964 
1965Note that while stripping thinking matches Qwen3's original post-training distribution, with RL fine-tuning the model should quickly adapt to the new situation where thinking is preserved. So "distribution match" might not be a major concern in practice.
1966 
1967## How the RL Code Handles This
1968 
1969The RL training code in <CookbookLink path="tinker_cookbook/rl/data_processing.py">`data_processing.py`</CookbookLink> automatically detects whether consecutive timesteps satisfy the extension property. The key function is `trajectory_to_data`:
1970 
1971```python
1972def trajectory_to_data(traj: Trajectory, traj_advantage: float) -> list[tinker.Datum]:
1973    """
1974    Return one or more Datum objects corresponding to the trajectory.
1975    If the sequence grows by appending, i.e., each successive observation contains
1976    the previous observation+action as a prefix, then we can return a single Datum.
1977    However, if we get a sequence that's not an extension of the previous sequence,
1978    then that results in a new Datum.
1979    """
1980```
1981 
1982When rendering your conversations, be aware of whether your renderer has the extension property. For `Qwen3Renderer`:
1983- `strip_thinking_from_history=False` → Extension holds
1984- `strip_thinking_from_history=True` (default) → Extension breaks
1985 
1986**Note on sampling:** The training code automatically merges timesteps when possible. Sampling infrastructure doesn't yet adjust billing based on KV-cache hits, but this is planned for a future release.
1987 
1988## Advanced: Periodic Compaction
1989 
1990A hybrid approach is to use **periodic compaction**: keep thinking visible most of the time (preserving extension), but periodically clear old thinking blocks from the context.
1991 
1992**How it works:**
1993- For turns 1-10, keep all thinking visible (extension holds, single datum)
1994- At turn 11, strip thinking from turns 1-10 (extension breaks once, new datum starts)
1995- For turns 11-20, keep thinking visible again (extension holds)
1996- Repeat every N turns
1997 
1998Here's what the datums look like with compaction every 3 turns:
1999 
2000**Datum 1 (turns 1-3):**
2001<div className="example">
2002<span className="prompt">User: Q1<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A1<br/>User:</span><span className="prompt"> Q2<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A2<br/>User:</span><span className="prompt"> Q3<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A3<br/>User:</span>
2003</div>
2004 
2005**Datum 2 (turns 4-6, thinking from turns 1-3 stripped):**
2006<div className="example">
2007<span className="prompt">User: Q1<br/>Assistant: A1<br/>User: Q2<br/>Assistant: A2<br/>User: Q3<br/>Assistant: A3<br/>User: Q4<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A4<br/>User:</span><span className="prompt"> Q5<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A5<br/>User:</span><span className="prompt"> Q6<br/>Assistant: </span><span className="completion">&lt;think&gt;...&lt;/think&gt; A6<br/>User:</span>
2008</div>
2009 
2010This approach breaks extension only every N timesteps instead of every timestep, keeps context size bounded (old thinking doesn't accumulate forever), and amortizes the recomputation cost over N turns.
2011 
2012To implement this, you would modify your environment or renderer to periodically transform the conversation history, stripping `<think>` blocks from messages older than N turns.
2013 
2014## Summary
2015 
2016For `Qwen3Renderer`:
2017- `strip_thinking_from_history=False` → Extension holds → Use for long trajectories where compute efficiency matters
2018- `strip_thinking_from_history=True` (default) → Extension breaks → Use for short trajectories, or when you want minimal changes from base model behavior
2019- Periodic compaction → Best of both worlds when you need efficiency with bounded context
2020 
2021When designing your RL environment, consider how many turns you expect and whether the O(T) vs O(T²) difference will be significant for your use case.
2022 
2023 
2024---
2025 
2026## File: rl/rl-hyperparams.mdx
2027 
2028# RL Hyperparameters
2029 
2030This guide covers the key hyperparameters for reinforcement learning training, from core settings to advanced configurations.
2031 
2032## Core Hyperparameters
2033 
2034### Learning Rate
2035 
2036Similar to the [supervised learning setting](../supervised-learning/sl-hyperparams), the learning rate is the most critical hyperparameter choice. We recommend using the guidance presented there as a starting point for RL experiments as well.
2037 
2038 
2039### Batch and Group Sizes
2040 
2041As described in our [RL environments](../rl/rl-envs.mdx) documentation, we use two key parameters:
2042 
2043- **`batch_size`**: The number of unique environments or problems used for training
2044- **`group_size`**: The number of rollouts performed per unique environment
2045 
2046If you have limited environments or problems available for training, increase the `group_size` to generate more training data. While the total number of rollouts depends on both parameters, we recommend scaling learning rates proportionally to $\text{LR} \propto \sqrt{\text{batch\_size}}$.
2047 
2048## Multiple Updates per Sampling Iteration
2049 
2050The `num_substeps` parameter controls how many policy weight updates are performed on data sampled from the last policy iteration, similar to PPO and GRPO.
2051 
2052### How it works:
2053 
2054- **`num_substeps = 1` (default)**: Each batch of collected trajectories is used for exactly one optimizer update
2055- **`num_substeps > 1`**: The batch of unique environments is split into `num_substeps` mini-batches, where each environment/problem has `group_size` rollouts (we pack all rollouts for a particular environment/problem in the same minibatch). We do a single update step on each mini-batch. Note that our implementation still takes only a single epoch through the data.
2056 
2057### Usage Guidelines:
2058 
2059- The batch size must be divisible by `num_substeps`
2060- Our experiments show that `num_substeps = 1` already gives decent performance, but if you would like to experiment with this parameter, we recommend starting with a low value of 2-4 and using the PPO objective.
2061- Higher values can lead to update steps that are too out-of-distribution for the policy. Consider limiting the number of updates or decreasing the learning rate when using multiple update steps.
2062 
2063## Advanced Training Configurations
2064 
2065⚠️ **Note**: These features are experimental and may be subject to instabilities. They are currently disabled by default.
2066 
2067### Streaming Minibatch Training
2068 
2069Enable streaming minibatch training by specifying the `StreamMinibatchConfig`. This approach overlaps trajectory sampling and model training, improving overall throughput by submitting training requests as soon as enough rollouts complete, without waiting for all sampling jobs to finish.
2070 
2071**Configuration Parameters:**
2072 
2073- **`groups_per_batch`**: Same as batch size
2074- **`num_minibatches`**: Number of minibatches per substep—controls how many individual forward-backward requests we submit. This controls how the work is split.
2075 
2076 
2077**Important**: This remains on-policy training and is strictly a pipeline efficiency improvement.
2078 
2079### Async Off-Policy Training
2080 
2081Async training allows the model to train on trajectories generated with slightly older model versions, enabling higher throughput at the cost of some off-policy bias. While Tinker doesn't currently support in-flight weight changes, it supports the "off-by-K" async RL approach where multiple model iterations generate data simultaneously. Configure this by setting the `AsyncConfig` object.
2082 
2083**Configuration Parameters:**
2084 
2085- **`max_steps_off_policy`**: Maximum age (in training steps) of trajectories before they're discarded. Essentially, trajectories from policy iterations older than `max_steps_off_policy` steps will not be used.
2086- **`groups_per_batch`**: Number of new trajectory groups to accumulate (with a `group_size` number of rollouts each) before updating the current iteration of the model. Note: This is separate from the batch size used for dataset construction.
2087 
2088**Usage Guidelines:**
2089 
2090- Async RL is appropriate for applications with long and heterogeneous rollouts, such as very long CoT models, multi-hop tool use, or agentic workflows
2091- Start with a small value for `max_steps_off_policy` (less than 5)
2092 
2093 
2094 
2095## Monitoring and Run Health
2096 
2097Using policy-gradient algorithms with off-policy data can significantly degrade performance or even crash the policy, making monitoring essential during training.
2098 
2099### KL Divergence Monitoring
2100 
2101The current implementation logs the KL divergence between the data generation policy and the current learner: $\mathbb{D}_{KL}[\pi_{\text{sampler}}(\cdot|x)||\pi_{\theta}(\cdot|x)]$ using two separate estimators ([Schulman 2020](http://joschu.net/blog/kl-approx.html)):
2102 
2103- `kl_sample_train_v1`
2104- `kl_sample_train_v2`
2105 
2106 
2107A few important notes to keep in mind:
2108- Even with full on-policy training, the divergence between sampling and learning policies will not be exactly zero ([He 2025](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)) due to implementation details
2109- In our experience training is stable with KL divergence below 0.01
2110- If KL divergence crosses a recommended threshold, this indicates a numerical instability or potential issue with the training run
2111 
2112 
2113---
2114 
2115## File: rl/rl-loops.mdx
2116 
2117import { CookbookLink } from '../../components/CookbookLink'
2118 
2119# Reinforcement Learning Training Loop
2120 
2121We've provided a simple RL training loop in <CookbookLink path="tinker_cookbook/recipes/rl_loop.py">rl_loop.py</CookbookLink>, which avoids using our environment classes and instead defines the data loading and rollouts in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in <CookbookLink path="tinker_cookbook/rl/train.py">rl/train.py</CookbookLink> does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.
2122 
2123You can run the RL training loop using:
2124```
2125python -m tinker_cookbook.recipes.rl_loop
2126```
2127 
2128The default config should write the results to `/tmp/tinker-examples/rl-loop`. The experiment should be completed after 57 steps of training. You can plot the reward curve as follows:
2129```python
2130import pandas
2131import matplotlib.pyplot as plt
2132 
2133metrics_path = "/tmp/tinker-examples/rl-loop/metrics.jsonl"
2134df = pandas.read_json(metrics_path, lines=True)
2135plt.plot(df["reward/total"], label="reward/total")
2136plt.legend()
2137plt.show()
2138```
2139 
2140You should see a plot like this:
2141![Reward as a function of steps](./images/rl_loop_reward.png)
2142 
2143 
2144---
2145 
2146## File: rl/rl-envs.mdx
2147 
2148import { CookbookLink } from '../../components/CookbookLink'
2149 
2150# RL Environments
2151 
2152Here, we'll explain how to create your own RL environments and train on them. First, lets look at the basic classes, which can be found in <CookbookLink path="tinker_cookbook/rl/types.py">`tinker_cookbook.rl.types`</CookbookLink>. As you can see, there's an `Env` interface, corresponding to an RL environment. To write an environment, you need to implement two methods: `initial_observation` and `step`.
2153 
2154```python
2155class Env:
2156    """
2157    Stateful environment that a single agent interacts with.
2158    Discard after running for one episode.
2159    """
2160 
2161    async def initial_observation(self) -> tuple[Observation, StopCondition]:
2162        raise NotImplementedError
2163 
2164    async def step(self, action: Action) -> StepResult:
2165        raise NotImplementedError
2166```
2167 
2168Note that this `Env` operates on *tokens*, rather than strings or messages. Why define it this way, when it's usually more natural to define the logic in terms of strings or messages? We've defined `Env` this way because this interface is what's needed by the *training* code, which needs to know the exact tokens that were sampled, and their logprobs.
2169 
2170We need to write two more small classes to use this environment in the RL training code. First, since the environment is discarded after a single episode, we need to be able to instantiate new environments in the training loop. We actually build a *group* of environments at a time, which enables multi-agent training or objectives that compare multiple samples (for example, a reward model that acts on a pair of samples).
2171 
2172```python
2173class EnvGroupBuilder:
2174    """
2175    Builds a group of environments.
2176    """
2177 
2178    async def make_envs(self) -> Sequence[Env]:
2179        raise NotImplementedError
2180```
2181 
2182This object creates a group of environments. Often it does the trivial thing of returning a list of copies of the same environment.
2183 
2184Finally, we need a dataset of these EnvGroupBuilders.
2185 
2186```python
2187class RLDataset:
2188    """
2189    Dataset of EnvGroupBuilders.
2190    """
2191 
2192    def get_batch(self, index: int) -> list[EnvGroupBuilder]:
2193        raise NotImplementedError
2194```
2195 
2196 
2197That's a lot of classes! But their combination gives us a lot of flexibility. In previous implementations (like OpenAI Gym), the dataset is implicitly part of the environment; this structure is more modular and gives us more control over the data loading.
2198 
2199## Building a simple example
2200 
2201You can find an example of writing a new RL environment in the <CookbookLink path="tinker_cookbook/recipes/multiplayer_rl/twenty_questions">Twenty Questions</CookbookLink> directory.
2202Here, we define a multi-step environment, where we're training a question-asking agent, which asks questions to another agent to guess a hidden word.
2203In this case, the answerer model is fixed and is Llama-3.1-8B-Instruct.
2204The player model (which we fine-tune) is also based on that same model.
2205 
2206You can run the training script as follows:
2207 
2208```bash
2209python -m tinker_cookbook.recipes.twenty_questions.train
2210```
2211 
2212 
2213---
2214 
2215## File: supervised-learning/sl-hyperparams.mdx
2216 
2217# Supervised Learning Hyperparameters
2218 
2219Successful LLM fine-tuning requires careful hyperparameter tuning. While the most accurate approach is to sweep over ranges and selecting values that minimize loss or maximize eval performance for each hyperparameter, this is often time-consuming and expensive. This guide provides some starting recommendations for the most important hyperparameters.
2220 
2221 
2222## Learning rate
2223 
2224The most important hyperparameter is generally the learning rate (LR). Our current best estimate of optimal LR for a model $m$ is the following:
2225 
2226$$ LR(m) = lr_{base} · M_{LoRA} · \Big(\frac{2000}{H_m}\Big)^{P_m} $$
2227 
2228where $lr_{base}$ is a constant base LR, $M_{LoRA}$ is a multiplier applied when using LoRA (1 if using full-finetuning), $H_m$ is the hidden size of the model $m$, and $P_m$ is a model-specific exponent adjustment. Importantly, this function is independent of the LoRA rank.
2229 
2230Our current best estimates are the following: $lr_{base} = 5e-5$,
2231$M_{LoRA} = 10$, $P_m = 0.0775$ for Qwen models and $P_m = 0.781$ for Llama models.
2232 
2233### Getting the recommended learning rate
2234You can use the following function to get the recommended LR for any Llama or Qwen model:
2235```
2236from tinker_cookbook.hyperparam_utils import get_lr
2237model_name = "meta-llama/Llama-3.2-1B"
2238recommended_lr = get_lr(model_name)
2239print(f"Recommended LR: {recommended_lr}")
2240```
2241### Validation
2242We validated this formula across diverse supervised fine-tuning experiments, varying datasets, dataset sizes, batch_sizes and lora_ranks.
2243 
2244Using our LR estimates resulted in \<0.5% regret compared to exhaustive hyperparameter sweeps, where regret is defined as:
2245 
2246We can define the regret of using any lr as the following:
2247$$regret(lr') = \frac{loss(lr') - min_{lr} loss(lr)}{min_{lr} loss(lr)}$$
2248 
2249 
2250## Batch size
2251 
2252Batch size is the second-most important hyperparameter; it significantly affects both training efficiency and final performance.
2253 
2254For small batch sizes, there's a phenomenon of *perfect scaling*, where the LR and batchsize should be varied together as $LR \propto \sqrt{B}$, and the learning curve only depends on $\frac{LR}{\sqrt{B}}$. See [Shallue et al. (2018)](https://arxiv.org/abs/1811.03600) for an example in the training-from-scratch setting.
2255 
2256When fine-tuning LLMs, we're often in a regime where smaller batch sizes give better performance, at the cost of longer training time; moreover, the $LR \propto \sqrt{B}$ scaling doesn't always hold. When doing SL fine-tuning, we recommend using smaller batch sizes like 128, depending on your tolerance for longer training time.
2257 
2258For best results, you should aim for at least 100 steps of training (but usually get best results with 1000 or more).
2259 
2260⚠️ Note: Our batch size recommendations are based on preliminary findings and ongoing research. We're not confident about them!
2261 
2262 
2263---
2264 
2265## File: supervised-learning/sl-basic.mdx
2266 
2267import { CookbookLink } from '../../components/CookbookLink'
2268 
2269# Basic Supervised Learning
2270 
2271This guide walks you through running your first supervised learning experiment using Tinker's built-in training loop.
2272 
2273## Quick start
2274 
2275We've provided an implementation of supervised learning in <CookbookLink path="tinker_cookbook/supervised/train.py">train_cli.py</CookbookLink>. To use this training loop, you'll need to create a `Config` object with the data and parameters.
2276 
2277We've provided a ready-to-run example that fine-tunes Llama-3.1-8B on a small instruction-following dataset in <CookbookLink path="tinker_cookbook/recipes/sl_basic.py">sl_basic.py</CookbookLink>. You can run it from the command line as follows:
2278 
2279```bash
2280python -m tinker_cookbook.recipes.sl_basic
2281```
2282 
2283This script fine-tunes the base (pretrained) model on a small dataset called [NoRobots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), created by Hugging Face.
2284 
2285### What you'll see during training
2286 
2287- Each step you should see a printout of the train and test loss, along with other stats like timing.
2288- The training script will also print out what the data looks like, with predicted tokens (weight=1) in green and context tokens (weight=0) in yellow.
2289- The training script will write various logs and checkpoint info to the `log_path` directory, which is set to `/tmp/tinker-examples/sl_basic` in the example script.
2290 
2291### Understanding the output files
2292Looking at the `log_path` directory, you will find several files of interest:
2293- `metrics.jsonl`: the training metrics that also were printed to the console. You can load and plot them like this:
2294 
2295    ```python
2296    import pandas
2297    import matplotlib.pyplot as plt
2298    df = pandas.read_json("/tmp/tinker-examples/sl_basic/metrics.jsonl", lines=True)
2299    plt.plot(df['train_mean_nll'], label='train_loss')
2300    plt.plot(df['test/nll'].dropna(), label='test_loss')
2301    plt.legend()
2302    plt.show()
2303    ```
2304You should see a plot like this:
2305![Train and test loss as a function of steps](./images/train_test_loss.png)
2306 
2307 
2308- `checkpoints.jsonl`: the checkpoints that were saved during training. Recall from [Saving and Loading](/save-load) that there are (currently) two kinds of checkpoints: one that has "/sampler_weights/" in the path (used for sampling), and the other that has "/weights/" in the path (includes full optimizer state, used for resuming training). If you interrupt the training script, then run it again, it will ask you if you want to resume training. If you choose to do so, it'll load the last (full state) checkpoint from this file.
2309- `config.json`: the configuration that you used for training.
2310 
2311In the `sl_basic` script, you'll see that there's also some disabled code (under `if 0:`) that shows how to use your own dataset, specified as a JSONL file, provided in the format of <CookbookLink path="example-data/conversations.jsonl">conversations.jsonl</CookbookLink>.
2312 
2313 
2314---
2315 
2316## File: supervised-learning/prompt-distillation.mdx
2317 
2318import { CookbookLink } from '../../components/CookbookLink'
2319 
2320# Prompt Distillation
2321 
2322Prompt distillation is a training technique in which a model is optimized to behave as though it had been provided with a long and complex prompt, without requiring access to that prompt during inference.
2323 
2324At a high level, this procedure involves two main steps:
2325- **Creation of distillation data**: A teacher prompt, which is typically lengthy and highly detailed, provides explicit, step-by-step instructions. A teacher model uses this prompt to generate responses for a set of queries.
2326- **Training the student model**: A student model is then trained (or fine-tuned) on the distilled dataset, thereby learning to reproduce the essential behaviors and reasoning encoded in the teacher’s instructions.
2327 
2328---
2329 
2330## Overview
2331 
2332Let $f_T$ and $f_S$ denote the teacher and student models, respectively. Given an instruction prompt $P$ and a query $q_i$, the teacher model generates a response $r_i$:
2333 
2334$$
2335r_i = f_T([P, q_i])
2336$$
2337 
2338Here, the prompt $P$ and the query $q_i$ are concatenated to form the input to the teacher model $f_T$. For a dataset of queries $Q = \{q_i \mid 1 \leq i \leq D\}$, we obtain a corresponding set of teacher responses $R = \{r_i \mid 1 \leq i \leq D\}$.
2339 
2340The distillation training dataset is defined as the set of query–response pairs (excluding the original prompt):
2341 
2342$$
2343T = \{(q_i, r_i) \mid 1 \leq i \leq D\}.
2344$$
2345 
2346The student model $f_S$ is then trained to minimize the cross-entropy loss:
2347 
2348$$
2349\ell(f_S(q_i), r_i) = \ell(f_S(q_i), f_T([P, q_i])).
2350$$
2351 
2352---
2353 
2354## Example
2355 
2356The Tinker Cookbook provides a prompt distillation recipe tailored for a language classification task. The objective is straightforward: given a text query, the model should predict a two-character code corresponding to the language of the input. The set of possible labels is:
2357```
2358ar (Arabic), de (German), el (Greek), en (English), es (Spanish), fr (French), hi (Hindi), ru (Russian), tr (Turkish), ur (Urdu), vi (Vietnamese), zh (Chinese - Simplified), ot (Other/Unknown).
2359```
2360 
2361The recipe in <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/create_data.py">recipes/prompt_distillation/create_data.py</CookbookLink> also includes handling strategies for inputs containing code, numerical content, or multiple languages.
2362 
2363In the example below, the same model (`Qwen/Qwen3-30B-A3B`) is used as both teacher and student, though in general they need not be identical.
2364 
2365---
2366 
2367### Step 1: Generate Training Data
2368 
2369Create prompt distillation data using the teacher model using <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/create_data.py">recipes/prompt_distillation/create_data.py</CookbookLink>:
2370 
2371```bash
2372python -m tinker_cookbook.recipes.prompt_distillation.create_data \
2373  output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
2374```
2375 
2376This command will:
2377- Use the configured teacher model to generate language classification examples
2378- Save the distilled dataset to the specified output file
2379- Create diverse training examples suitable for student model fine-tuning
2380 
2381### Step 2: Train the Student Model
2382 
2383Fine-tune a student model on the distillation data using <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/train.py">recipes/prompt_distillation/train.py</CookbookLink>:
2384 
2385```bash
2386python -m tinker_cookbook.recipes.prompt_distillation.train
2387```
2388 
2389The training script will:
2390- Load the generated distillation dataset
2391- Apply optimized training configurations
2392- Fine-tune the student model for language classification
2393 
2394### Step 3: Test Your Model
2395 
2396Once training is complete, you can test your distilled model by sampling from the trained model to verify its performance on language classification tasks.
2397 
2398## Advanced Configuration
2399 
2400The prompt distillation recipe can be customized for different scenarios:
2401 
2402- **Teacher model selection**: Choose different base models based on your requirements
2403- **Sampling strategies**: Adjust temperature and other generation parameters
2404- **Data volume**: Scale the number of generated examples based on your needs
2405- **Training hyperparameters**: Fine-tune learning rates and other training settings
2406 
2407 
2408---
2409 
2410## File: supervised-learning/sweep-case-study.mdx
2411 
2412import { CookbookLink } from '../../components/CookbookLink'
2413 
2414# Sweep case study
2415 
2416In [Supervised Learning Hyperparameters](./sl-hyperparams), we introduced default hyperparameters as a starting point. While defaults are useful, optimal values are often task-specific. A hyperparameter sweep---systematically testing values across a range---is a more reliable way to identify the best settings for your use case.
2417 
2418This guide demonstrates how to sweep over the **learning rate (LR)** to find an optimal value.
2419 
2420## Why sweep the learning rate?
2421 
2422The learning rate is typically the most impactful hyperparameter. While our default recommendations perform well (usually \<0.5% regret), you can often achieve even better results by sweeping to find the task-specific optimum.
2423 
2424 
2425## Setup
2426 
2427We use the simple supervised learning training loop in
2428<CookbookLink path="tinker_cookbook/recipes/sl_loop.py">sl_loop.py</CookbookLink>, which trains a Llama-3.1-8B model.
2429 
2430To retrieve the model’s default learning rate recommendation:
2431```
2432from tinker_cookbook.hyperparam_utils import get_lr
2433print(get_lr("meta-llama/Llama-3.1-8B"))
2434```
2435This should output
2436```
24370.0002856415043086949  # ≈ 2.8e-4
2438```
2439This default value provides a baseline. A common best practice is to sweep one order of magnitude above and below the default. For this case, we sweep over: $LR \in [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]$
2440 
2441 
2442 
2443## Running the sweep
2444Launch experiments in parallel, using separate terminal windows for each LR value. For example:
2445```bash
2446python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sft-lr-sweep/lr-0.003
2447python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sft-lr-sweep/lr-0.001
2448python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sft-lr-sweep/lr-0.0003
2449python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sft-lr-sweep/lr-0.0001
2450python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sft-lr-sweep/lr-0.00003
2451python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sft-lr-sweep/lr-0.00001
2452```
2453You can also automate this process by writing a script that spawns multiple tmux windows and launches experiments programmatically. This is especially useful for larger sweeps.
2454 
2455 
2456## Collecting Results
2457After the experiments are complete, you can read the `metrics.jsonl` files:
2458```python
2459from glob import glob
2460import pandas
2461import os
2462import json
2463 
2464data = []
2465for fname in sorted(glob(os.path.expanduser("/tmp/sft-lr-sweep/*/metrics.jsonl"))):
2466    df = pandas.read_json(fname, lines=True)
2467    # make sure the experiment is completed
2468    if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
2469        continue
2470    config_fname = fname.replace("metrics.jsonl", "config.json")
2471    with open(config_fname, "rb") as f:
2472        metadata = json.load(f)
2473    data.append({
2474        "fname": fname,
2475        "learning_rate": metadata["learning_rate"],
2476        "final_loss": df["train_mean_nll"].iloc[-1].item()
2477    })
2478 
2479print(f"Read metrics for {len(data)} experiments")
2480```
2481If all the experiments are completed, the above code should print:
2482```
2483Read metrics for 6 experiments
2484```
2485 
2486## Visualizing the Sweep
2487Plot the `final_loss` as a function of `learning_rate`:
2488```python
2489import matplotlib.pyplot as plt
2490df = pandas.DataFrame(data)
2491plt.plot(df["learning_rate"], df["final_loss"], marker='o')
2492plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--")
2493plt.ylim(1.65, 1.8)
2494plt.xscale("log")
2495plt.xlabel("Learning Rate (log scale)")
2496plt.ylabel("Final Loss")
2497plt.title("Final Loss vs Learning Rate")
2498plt.show()
2499```
2500You should see a U-shaped curve, similar to this:
2501![final_loss_vs_lr](./images/lr_sweep.png)
2502 
2503If the full U-curve is not visible in your setting, expand the sweep range by adding more LR values.
2504 
2505 
2506## Determining the Optimal LR
2507The optimal learning rate is the one that minimizes the loss. The plot above shows that the optimal LR is `3e-4` which you can also calculate by finding the minima:
2508```
2509optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
2510print(f"The optimal LR is {optimal_lr:.2e}")
2511```
2512Expected output:
2513```
2514The optimal LR is 3.00e-04
2515```
2516 
2517Note that the optimal LR in our sweep (`3e-4`) is very close to the default LR (`2.8e-4`). However, task-specific sweeps can still provide marginal improvements and greater confidence in your hyperparameter choices.
2518 
2519## Next steps
2520Now that you've identified the optimal learning rate:
25211. Retrain with the optimal LR for your production run
25222. Consider sweeping other hyperparameters like batch size, warmup steps, or weight decay
25233. Use the optimal LR as a baseline for future experiments on similar tasks
2524 
2525 
2526---
2527 
2528## File: supervised-learning/sl-loop.mdx
2529 
2530import { CookbookLink } from '../../components/CookbookLink'
2531 
2532# Supervised Learning Training Loop
2533 
2534We've provided a simple SL training loop in <CookbookLink path="tinker_cookbook/recipes/sl_loop.py">sl_loop.py</CookbookLink>, which avoids using our dataset classes and instead defines the data loading in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in <CookbookLink path="tinker_cookbook/supervised/train.py">supervised/train.py</CookbookLink> does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.
2535 
2536 
2537---
2538 
2539## File: compatible-apis/openai.mdx
2540 
2541# OpenAI API Compatible Inference (in beta)
2542 
2543OpenAI-compatible inference lets you interact with any model checkpoint in Tinker, using an endpoint compatible with the [OpenAI Completions API](https://platform.openai.com/docs/api-reference/chat). It’s designed to let you easily “poke at” your model while you're training it.
2544 
2545For inference within your training runs (e.g. RL), we recommend using Tinker’s standard [sampling client](/training-sampling).
2546 
2547Currently, OpenAI-compatible inference is meant for testing and internal use with low internal traffic, rather than large, high-throughput, user-facing deployments. Latency and throughput may vary by model and may change without notice during the beta. If you need higher or more stable throughput, contact the Tinker team in [our Discord](https://discord.gg/KqqEZNX88c) for guidance on larger-scale setups.
2548 
2549## Use Cases
2550 
2551OpenAI-compatible inference is designed for
2552- **Fast feedback while training**: Start sampling very quickly from any sampler checkpoint obtained during training.
2553- **Sampling while training continues**: Sample even while the training job is still running on that experiment.
2554- **Developer &amp; internal workflows**: Intended for testing, evaluation, and internal tools.
2555 
2556We will release production-grade inference soon and will update our users then.
2557 
2558## Using OpenAI compatible inference  from an OpenAI client
2559 
2560The new interface exposes an OpenAI-compatible HTTP API. You can use any OpenAI SDK or HTTP client that lets you override the base URL.
2561 
25621\. Set the base URL of your OpenAI-compatible client to:
2563 
2564```
2565https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1
2566```
2567 
25682\. Use a Tinker sampler weight path as the model name. For example:
2569 
2570```
2571tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080
2572```
2573 
2574Any valid Tinker sampler checkpoint path works here. You can keep training and sample from the same checkpoint simultaneously.
2575 
25763\. Authenticate with your Tinker API key, by passing the same key used for Tinker as the API key to the OpenAI client.
2577 
2578**Note:** We support both `/completions` and `/chat/completions` endpoints. Chat requests are rendered with the model’s default Hugging Face chat template; if your checkpoint expects a different renderer, render the prompt yourself (see [Rendering](/rendering)) and use `/completions`.
2579 
2580## Code Example
2581 
2582```py
2583from os import getenv
2584from openai import OpenAI
2585 
2586BASE_URL = "https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1"
2587MODEL_PATH = "tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080"
2588 
2589api_key = getenv("TINKER_API_KEY")
2590 
2591client = OpenAI(
2592    base_url=BASE_URL,
2593    api_key=api_key,
2594)
2595 
2596response = client.completions.create(
2597    model=MODEL_PATH,
2598    prompt="The capital of France is",
2599    max_tokens=50,
2600    temperature=0.7,
2601    top_p=0.9,
2602)
2603 
2604print(f"{response.choices[0].text}")
2605```
2606 
2607Notes:
2608 
2609* `BASE_URL` points to the OpenAI compatible inference endpoint.
2610* `MODEL_PATH` is a sampler checkpoint path from Tinker (`tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080`).
2611* The rest of the arguments (`prompt`, `max_tokens`, `temperature`, `top_p`) behave like they do in the OpenAI Completions API.
2612* You can swap `MODEL_PATH` to any other sampler checkpoint to compare runs quickly in your evals or notebooks.
2613 
2614## Related docs
2615 
2616* [Getting a `TINKER_API_KEY`](/install)
2617 
2618* [Security and Privacy](https://thinkingmachines.ai/legal/terms/)
2619 
2620* [Training and Sampling](/training-sampling)
2621 
2622 
2623---
2624 
2625# PART 2: TYPE DEFINITIONS
2626 
2627Total types collected: 30
2628 
2629## Type: AdamParams
2630 
2631```python
2632class AdamParams(StrictBase):
2633    learning_rate: float = 0.0001
2634    """Learning rate for the optimizer"""
2635 
2636    beta1: float = 0.9
2637    """Coefficient used for computing running averages of gradient"""
2638 
2639    beta2: float = 0.95
2640    """Coefficient used for computing running averages of gradient square"""
2641 
2642    eps: float = 1e-12
2643    """Term added to the denominator to improve numerical stability"""
2644```
2645 
2646## Type: CreateModelResponse
2647 
2648```python
2649class CreateModelResponse(BaseModel):
2650    model_id: ModelID
2651 
2652    type: Literal["create_model"] = "create_model"
2653```
2654 
2655## Type: Datum
2656 
2657```python
2658class Datum(StrictBase):
2659    loss_fn_inputs: LossFnInputs
2660    """Dictionary mapping field names to tensor data"""
2661 
2662    model_input: ModelInput
2663 
2664    @model_validator(mode="before")
2665    @classmethod
2666    def convert_tensors(cls, data: Any) -> Any:
2667        """Convert torch.Tensor and numpy arrays to TensorData in loss_fn_inputs during construction."""
2668        if isinstance(data, dict) and "loss_fn_inputs" in data:
2669            loss_fn_inputs = data["loss_fn_inputs"]
2670            if isinstance(loss_fn_inputs, dict):
2671                converted_inputs = {}
2672                for key, value in loss_fn_inputs.items():
2673                    converted_inputs[key] = cls._maybe_convert_array(key, value)
2674                data = dict(data)  # Make a copy
2675                data["loss_fn_inputs"] = converted_inputs
2676        return data
2677 
2678    @classmethod
2679    def _maybe_convert_array(cls, key: str, value: Any) -> Any:
2680        """Convert torch.Tensor, numpy array, or 1-D list to TensorData if needed."""
2681        if _HAVE_TORCH and isinstance(value, torch.Tensor):
2682            return TensorData.from_torch(value)
2683        elif isinstance(value, np.ndarray):
2684            return TensorData.from_numpy(value)
2685        elif isinstance(value, list):
2686            # assume it's 1d and infer the dtype from the key
2687            return TensorData(data=value, dtype=_key_to_type[key], shape=[len(value)])
2688        else:
2689            return value
2690 
2691 
2692_key_to_type = {
2693    "target_tokens": "int64",
2694    "weights": "float32",
2695    "advantages": "float32",
2696    "logprobs": "float32",
2697    "clip_low_threshold": "float32",
2698    "clip_high_threshold": "float32",
2699}
2700```
2701 
2702## Type: EncodedTextChunk
2703 
2704```python
2705class EncodedTextChunk(StrictBase):
2706    tokens: Sequence[int]
2707    """Array of token IDs"""
2708 
2709    type: Literal["encoded_text"] = "encoded_text"
2710 
2711    @property
2712    def length(self) -> int:
2713        return len(self.tokens)
2714```
2715 
2716## Type: ForwardBackwardInput
2717 
2718```python
2719class ForwardBackwardInput(StrictBase):
2720    data: List[Datum]
2721    """Array of input data for the forward/backward pass"""
2722 
2723    loss_fn: LossFnType
2724    """Fully qualified function path for the loss function"""
2725 
2726    loss_fn_config: Optional[Dict[str, float]] = None
2727    """Optional configuration parameters for the loss function (e.g., PPO clip thresholds, DPO beta)"""
2728```
2729 
2730## Type: ForwardBackwardOutput
2731 
2732```python
2733class ForwardBackwardOutput(BaseModel):
2734    loss_fn_output_type: str
2735    """The type of the ForwardBackward output. Can be one of [...] TODO"""
2736 
2737    loss_fn_outputs: List[LossFnOutput]
2738    """Dictionary mapping field names to tensor data"""
2739 
2740    metrics: Dict[str, float]
2741    """Training metrics as key-value pairs"""
2742```
2743 
2744## Type: GetInfoResponse
2745 
2746```python
2747class GetInfoResponse(BaseModel):
2748    type: Optional[Literal["get_info"]] = None
2749 
2750    model_data: ModelData
2751 
2752    model_id: ModelID
2753 
2754    is_lora: Optional[bool] = None
2755 
2756    lora_rank: Optional[int] = None
2757 
2758    model_name: Optional[str] = None
2759 
2760    if PYDANTIC_V2:
2761        # allow fields with a `model_` prefix
2762        model_config = ConfigDict(protected_namespaces=tuple())
2763```
2764 
2765## Type: GetServerCapabilitiesResponse
2766 
2767```python
2768class GetServerCapabilitiesResponse(BaseModel):
2769    supported_models: List[SupportedModel]
2770```
2771 
2772## Type: ImageAssetPointerChunk
2773 
2774```python
2775class ImageAssetPointerChunk(StrictBase):
2776    format: Literal["png", "jpeg"]
2777    """Image format"""
2778 
2779    location: str
2780    """Path or URL to the image asset"""
2781 
2782    expected_tokens: int | None = None
2783    """Expected number of tokens this image represents.
2784    This is only advisory: the tinker backend will compute the number of tokens
2785    from the image, and we can fail requests quickly if the tokens does not
2786    match expected_tokens."""
2787 
2788    type: Literal["image_asset_pointer"] = "image_asset_pointer"
2789 
2790    @property
2791    def length(self) -> int:
2792        if self.expected_tokens is None:
2793            raise ValueError("ImageAssetPointerChunk expected_tokens needs to be set in order to compute the length")
2794        return self.expected_tokens
2795```
2796 
2797## Type: ImageChunk
2798 
2799```python
2800class ImageChunk(StrictBase):
2801    data: bytes
2802    """Image data as bytes"""
2803 
2804    format: Literal["png", "jpeg"]
2805    """Image format"""
2806 
2807    expected_tokens: int | None = None
2808    """Expected number of tokens this image represents.
2809    This is only advisory: the tinker backend will compute the number of tokens
2810    from the image, and we can fail requests quickly if the tokens does not
2811    match expected_tokens."""
2812 
2813    type: Literal["image"] = "image"
2814 
2815    @field_validator("data", mode="before")
2816    @classmethod
2817    def validate_data(cls, value: Union[bytes, str]) -> bytes:
2818        """Deserialize base64 string to bytes if needed."""
2819        if isinstance(value, str):
2820            return base64.b64decode(value)
2821        return value
2822 
2823    @field_serializer("data")
2824    def serialize_data(self, value: bytes) -> str:
2825        """Serialize bytes to base64 string for JSON."""
2826        return base64.b64encode(value).decode("utf-8")
2827 
2828    @property
2829    def length(self) -> int:
2830        if self.expected_tokens is None:
2831            raise ValueError("ImageChunk expected_tokens needs to be set in order to compute the length")
2832        return self.expected_tokens
2833```
2834 
2835## Type: LoadWeightsResponse
2836 
2837```python
2838class LoadWeightsResponse(BaseModel):
2839    path: Optional[str] = None
2840    """A tinker URI for model weights at a specific step"""
2841 
2842    type: Optional[Literal["load_weights"]] = None
2843```
2844 
2845## Type: LoraConfig
2846 
2847```python
2848class LoraConfig(StrictBase):
2849    rank: int
2850    """LoRA rank (dimension of low-rank matrices)"""
2851 
2852    seed: Optional[int] = None
2853    """Seed used for initialization of LoRA weights.
2854 
2855    Useful if you need deterministic or reproducible initialization of weights.
2856    """
2857 
2858    train_unembed: bool = True
2859    """Whether to add lora to the unembedding layer"""
2860 
2861    train_mlp: bool = True
2862    """Whether to add loras to the MLP layers (including MoE layers)"""
2863 
2864    train_attn: bool = True
2865    """Whether to add loras to the attention layers"""
2866```
2867 
2868## Type: LossFnInputs
2869 
2870```python
2871LossFnInputs: TypeAlias = Dict[str, TensorData]
2872```
2873 
2874## Type: LossFnOutput
2875 
2876```python
2877LossFnOutput: TypeAlias = Dict[str, TensorData]
2878```
2879 
2880## Type: LossFnType
2881 
2882```python
2883LossFnType: TypeAlias = Literal["cross_entropy", "importance_sampling", "ppo", "cispo", "dro"]
2884```
2885 
2886## Type: ModelData
2887 
2888```python
2889class ModelData(BaseModel):
2890    arch: Optional[str] = None
2891 
2892    model_name: Optional[str] = None
2893 
2894    tokenizer_id: Optional[str] = None
2895```
2896 
2897## Type: ModelID
2898 
2899```python
2900ModelID: TypeAlias = str
2901```
2902 
2903## Type: ModelInput
2904 
2905```python
2906class ModelInput(StrictBase):
2907    chunks: List[ModelInputChunk]
2908    """Sequence of input chunks (formerly TokenSequence)"""
2909 
2910 
2911    @classmethod
2912    def from_ints(cls, tokens: List[int]) -> "ModelInput":
2913        """
2914        Create a ModelInput from a list of ints (tokens).
2915        """
2916        return cls(chunks=[EncodedTextChunk(tokens=tokens)])
2917 
2918    def to_ints(self) -> List[int]:
2919        """
2920        Convert the ModelInput to a list of ints (tokens)
2921        Throws exception if there are any non-token chunks
2922        """
2923        if not all(isinstance(chunk, EncodedTextChunk) for chunk in self.chunks):
2924            raise ValueError(f"to_ints only supported for ModelInput with EncodedTextChunks, got {[type(chunk) for chunk in self.chunks]}")
2925        return [token for chunk in self.chunks for token in chunk.tokens]
2926 
2927    @property
2928    def length(self) -> int:
2929        """
2930        Return the total context length used by this ModelInput.
2931        """
2932        return sum(chunk.length for chunk in self.chunks)
2933 
2934    @classmethod
2935    def empty(cls) -> "ModelInput":
2936        """
2937        Create an empty ModelInput.
2938        """
2939        return cls(chunks=[])
2940 
2941    def append(self, chunk: ModelInputChunk) -> "ModelInput":
2942        """
2943        Add a new chunk, return a new ModelInput.
2944        """
2945        return ModelInput(chunks=self.chunks + [chunk])
2946 
2947    def append_int(self, token: int) -> "ModelInput":
2948        """
2949        Add a new token, return a new ModelInput.
2950        """
2951        return self.append(EncodedTextChunk(tokens=[token]))
2952```
2953 
2954## Type: ModelInputChunk
2955 
2956```python
2957ModelInputChunk: TypeAlias = Annotated[
2958    Union[EncodedTextChunk, ImageAssetPointerChunk, ImageChunk], PropertyInfo(discriminator="type")
2959]
2960```
2961 
2962## Type: OptimStepResponse
2963 
2964```python
2965class OptimStepResponse(BaseModel):
2966    metrics: Optional[Dict[str, float]] = None
2967    """Optimization step metrics as key-value pairs"""
2968```
2969 
2970## Type: SampleResponse
2971 
2972```python
2973class SampleResponse(BaseModel):
2974    sequences: Sequence[SampledSequence]
2975 
2976    type: Literal["sample"] = "sample"
2977 
2978    prompt_logprobs: Optional[List[Optional[float]]] = None
2979    """
2980    If prompt_logprobs was set to true in the request, logprobs are computed for
2981    every token in the prompt. The `prompt_logprobs` response contains a float32
2982    value for every token in the prompt.
2983    """
2984 
2985    topk_prompt_logprobs: Optional[list[Optional[list[tuple[int, float]]]]] = None
2986    """
2987    If topk_prompt_logprobs was set to a positive integer k in the request,
2988    the top-k logprobs are computed for every token in the prompt. The
2989    `topk_prompt_logprobs` response contains, for every token in the prompt,
2990    a list of up to k (token_id, logprob) tuples.
2991    """
2992```
2993 
2994## Type: SampledSequence
2995 
2996```python
2997class SampledSequence(BaseModel):
2998    stop_reason: StopReason
2999    """Reason why sampling stopped"""
3000 
3001    tokens: List[int]
3002    """List of generated token IDs"""
3003 
3004    logprobs: Optional[List[float]] = None
3005    """Log probabilities for each token (optional)"""
3006```
3007 
3008## Type: SamplingParams
3009 
3010```python
3011class SamplingParams(BaseModel):
3012    max_tokens: Optional[int] = None
3013    """Maximum number of tokens to generate"""
3014 
3015    seed: Optional[int] = None
3016    """Random seed for reproducible generation"""
3017 
3018    stop: Union[str, Sequence[str], Sequence[int], None] = None
3019    """Stop sequences for generation"""
3020 
3021    temperature: float = 1
3022    """Sampling temperature"""
3023 
3024    top_k: int = -1
3025    """Top-k sampling parameter (-1 for no limit)"""
3026 
3027    top_p: float = 1
3028    """Nucleus sampling probability"""
3029```
3030 
3031## Type: SaveWeightsForSamplerResponse
3032 
3033```python
3034class SaveWeightsForSamplerResponse(BaseModel):
3035    path: str
3036    """A tinker URI for model weights for sampling at a specific step"""
3037 
3038    type: Optional[Literal["save_weights_for_sampler"]] = None
3039```
3040 
3041## Type: SaveWeightsResponse
3042 
3043```python
3044class SaveWeightsResponse(BaseModel):
3045    path: str
3046    """A tinker URI for model weights at a specific step"""
3047 
3048    type: Optional[Literal["save_weights"]] = None
3049```
3050 
3051## Type: StopReason
3052 
3053```python
3054StopReason: TypeAlias = Literal["length", "stop"]
3055```
3056 
3057## Type: SupportedModel
3058 
3059```python
3060class SupportedModel(BaseModel):
3061    model_name: Optional[str] = None
3062```
3063 
3064## Type: TensorData
3065 
3066```python
3067class TensorData(StrictBase):
3068    data: Union[List[int], List[float]]
3069    """Flattened tensor data as array of numbers."""
3070 
3071    dtype: TensorDtype
3072 
3073    shape: Optional[List[int]] = None
3074    """Optional.
3075 
3076    The shape of the tensor (see PyTorch tensor.shape). The shape of a
3077    one-dimensional list of length N is `(N,)`. Can usually be inferred if not
3078    provided, and is generally inferred as a 1D tensor.
3079    """
3080 
3081    @classmethod
3082    def from_numpy(cls, array: npt.NDArray[Any]) -> "TensorData":
3083        return cls(
3084            data=array.flatten().tolist(),
3085            dtype=_convert_numpy_dtype_to_tensor(array.dtype),
3086            shape=list(array.shape),
3087        )
3088 
3089    @classmethod
3090    def from_torch(cls, tensor: "torch.Tensor") -> "TensorData":
3091        return cls(
3092            data=tensor.flatten().tolist(),
3093            dtype=_convert_torch_dtype_to_tensor(tensor.dtype),
3094            shape=list(tensor.shape),
3095        )
3096 
3097    def to_numpy(self) -> npt.NDArray[Any]:
3098        """Convert TensorData to numpy array."""
3099        numpy_dtype = _convert_tensor_dtype_to_numpy(self.dtype)
3100        arr = np.array(self.data, dtype=numpy_dtype)
3101        if self.shape is not None:
3102            arr = arr.reshape(self.shape)
3103        return arr
3104 
3105    def to_torch(self) -> "torch.Tensor":
3106        """Convert TensorData to torch tensor."""
3107        if not _HAVE_TORCH:
3108            raise ImportError("PyTorch is not installed. Cannot convert to torch tensor.")
3109 
3110        torch_dtype = _convert_tensor_dtype_to_torch(self.dtype)
3111        tensor = torch.tensor(self.data, dtype=torch_dtype)
3112        if self.shape is not None:
3113            tensor = tensor.reshape(self.shape)
3114        return tensor
3115 
3116    def tolist(self) -> List[Any]:
3117        return self.to_numpy().tolist()
3118 
3119 
3120def _convert_tensor_dtype_to_numpy(dtype: TensorDtype) -> npt.DTypeLike:
3121    """Convert TensorDtype to numpy dtype-like."""
3122    if dtype == "float32":
3123        return np.float32
3124    elif dtype == "int64":
3125        return np.int64
3126    else:
3127        raise ValueError(f"Unsupported TensorDtype: {dtype}")
3128 
3129 
3130def _convert_tensor_dtype_to_torch(dtype: TensorDtype) -> "torch.dtype":
3131    """Convert TensorDtype to torch dtype."""
3132    if not _HAVE_TORCH:
3133        raise ImportError("PyTorch is not installed. Cannot convert to torch dtype.")
3134    import torch
3135 
3136    if dtype == "float32":
3137        return torch.float32
3138    elif dtype == "int64":
3139        return torch.int64
3140    else:
3141        raise ValueError(f"Unsupported TensorDtype: {dtype}")
3142 
3143 
3144def _convert_numpy_dtype_to_tensor(dtype: np.dtype[Any]) -> TensorDtype:
3145    """Convert numpy dtype to TensorDtype."""
3146    if dtype.kind == "f":
3147        return "float32"
3148    elif dtype.kind == "i":
3149        return "int64"
3150    else:
3151        raise ValueError(f"Unsupported numpy dtype: {dtype}")
3152 
3153 
3154def _convert_torch_dtype_to_tensor(dtype: "torch.dtype") -> TensorDtype:
3155    """Convert torch dtype to TensorDtype."""
3156    # torch.dtype objects have .is_floating_point
3157    if getattr(dtype, "is_floating_point", False):
3158        return "float32"
3159    else:
3160        return "int64"
3161```
3162 
3163## Type: TensorDtype
3164 
3165```python
3166TensorDtype: TypeAlias = Literal["int64", "float32"]
3167```
3168 
3169## Type: UnloadModelResponse
3170 
3171```python
3172class UnloadModelResponse(BaseModel):
3173    model_id: ModelID
3174 
3175    type: Optional[Literal["unload_model"]] = None
3176```
3177
Preparing the source view

Agent Skills for Context Engineering

examples/book-sft-pipeline/references/tinker.txt