Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/book-sft-pipeline/references/tinker.txt
1# TINKER DOCUMENTATION2This file contains the complete Tinker documentation and SDK reference.34## Table of Contents561. Documentation (MDX files)72. Type Definitions (from tinker.types)89---1011# PART 1: DOCUMENTATION1213## File: index.mdx1415# Tinker: a training API for researchers and developers1617Tinker lets you focus on what matters in LLM fine-tuning – your data and algorithms – while we handle the heavy lifting of distributed training.1819You write a simple loop that runs on your CPU-only machine, including the data or environment and the loss function. We figure out how to make the training work on a bunch of GPUs, doing the exact computation you specified, efficiently. To change the model you're working with, you only need to change a single string in your code.2021Tinker gives you full control over the training loop and all the algorithmic details. It's not a magic black box that makes fine-tuning "easy". It's a clean abstraction that shields you from the complexity of distributed training while preserving your control.2223Here's how the division of responsibilities works in practice:2425| **You focus on** | **You write** | **We handle** |26|---|---|---|27| **Datasets and RL environments**<br />Your custom training data | **Simple Python script**<br />Runs on your CPU | **Efficient distributed training of large models**<br />Llama 70B, Qwen 235B |28| **Training logic**<br />Your loss functions, training loop, and evals | **API calls**<br />`forward_backward()`<br />`optim_step()`<br />`sample()`<br />`save_state()` | **Reliability**<br />Hardware failures handled transparently |2930## Features3132What the Tinker service currently supports:3334- Tinker lets you fine-tune open-weight models like the Qwen and Llama series, including large mixture-of-experts models like Qwen3-235B-A22B.35- Tinker supports vision-language models (VLMs) like Qwen3-VL for image understanding tasks. See [Vision Inputs](/rendering#vision-inputs) for details.36- Tinker implements low-rank adaptation (LoRA) fine-tuning, not full fine-tuning. However, we believe that LoRA gives the same performance as full fine-tuning for many important use cases, especially in RL (see [LoRA Without Regret](https://thinkingmachines.ai/blog/lora/)).37- You can download the weights of your trained model to use outside of Tinker, for example with your inference provider of choice.3839## A quick look at functionality4041Tinker's main functionality is contained in a few key functions:4243- `forward_backward`: feed in your data and loss function, and we'll compute and accumulate the gradients for you.44- `optim_step`: update your model using the accumulated gradients45- `sample`: Generate outputs from your trained model46- other functions for saving and loading weights and optimizer state4748## What's next?4950Some features we expect to support in the future:5152- Full fine-tuning535455---5657## File: losses.mdx5859import { CookbookLink } from '../components/CookbookLink'6061# Loss functions in Tinker6263For most use cases, you can use the Tinker API's built-in loss functions by passing in a string identifier to `forward_backward`, which supports cross-entropy and policy gradient objectives. When you need more control, `forward_backward_custom` enables arbitrary differentiable loss functions at the cost of an additional forward pass; we explain both approaches in this doc.6465When you call `forward_backward`, you specify a loss function using a string that selects from a predetermined set of options, comprising the most common losses used for language model training.66- **Input:** `forward_backward` expects a certain set of input tensors, passed in via `datum.loss_fn_inputs`, which is a dict mapping `str` to either a numpy or torch tensor67- **Output:** `forward_backward` returns a `ForwardBackwardOutput`, which has a set of output tensors in `fwd_bwd_result.loss_fn_outputs`6869For an example of using `forward_backward`, see `rl/train.py` in the Cookbook:70```python71import tinker72import torch73from tinker import TensorData7475# Create training data with required inputs76datum = tinker.Datum(77model_input=input_tokens,78loss_fn_inputs={79"target_tokens": TensorData.from_torch(torch.tensor(target_tokens)),80"logprobs": TensorData.from_torch(torch.tensor(sampling_logprobs)), # Reference logprobs81"advantages": TensorData.from_torch(torch.tensor(advantages)),82}83)8485# Option 1: Use importance sampling REINFORCE86fwd_bwd_result = await training_client.forward_backward_async(87[datum], loss_fn="importance_sampling"88)8990# Option 2: Use PPO with clipping91fwd_bwd_result = await training_client.forward_backward_async(92[datum], loss_fn="ppo"93)94```9596## Basic loss functions9798Currently, the Tinker API supports `cross_entropy` (for supervised learning), `importance_sampling`, `ppo`, `cispo` and `dro` for RL. We denote the training model as $p_{\theta}$, the sampling distribution as $q$, and advantages as $A$. Also, for notation simplicity we omit the query and denote the full model completion sequence of tokens as $x$.99100All losses are applied at the token level and tensors below have shape `(N,)` where `N` is `model_input.length`. They can be provided as `numpy.ndarray` or `torch.Tensor`, and the return values will use the same tensor type.101102### Supervised learning: `cross_entropy`103104For SL, we implement the standard cross-entropy loss (i.e., negative log-likelihood), which optimizes the policy $p_\theta$ to maximize the log-probability of the tokens $x$:105106$$107\mathcal{L(\theta)} = -\mathbb{E}_x[\log p_\theta(x)]108$$109110where `weights` is either 0 or 1, typically generated from `renderer.build_supervised_example()` which returns `(model_input, weights)` (i.e., to specify the desired assistant turns to train on).111112This is implemented as:113114```python115# Apply weights and compute elementwise loss116elementwise_loss = -target_logprobs * weights117# Apply sum reduction to get the total loss118loss = elementwise_loss.sum() # scalar119```120121- **Input tensors:**122- `target_tokens: array[(N,), int]` - Target token IDs123- `weights: array[(N,), float]` - Token-level loss weights (typically from the renderer)124- **Output tensors:**125- `logprobs: array[(N,), float]` - Log probabilities of predicted tokens126- **Output diagnostics:**127- `loss:sum` (scalar) - Sum of weighted cross-entropy losses128129### Policy gradient: `importance_sampling`130131For RL, we implement a common variant of the policy gradient objective, used in practical settings where the *learner policy* $p$ may differ from the *sampling policy* $q$, which is common due to, e.g., [non-determinism](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). The issue is that if these policies differ, then the objective:132133$$134\mathcal{L}(\theta) = \mathbb{E}_{x\sim p_\theta}\bigl[A(x)\bigr]135$$136137is not computed in an unbiased why due to $x \sim q$ (sampler) not exactly matching the desired $x \sim p_\theta$ (learner). To correct the bias, we use a modified "importance sampling" objective:138139$$140\mathcal{L}_{\text{IS}}(\theta) = \mathbb{E}_{x\sim q}\Bigl[\frac{p_\theta(x)}{q(x)}A(x)\Bigr],141$$142143which yields the correct expected reward. In the formula above:144145- $\log p_\theta(x)$ – `target_logprobs` is from the learner, on the forward part of the `forward_backward` pass.146- $\log q(x)$ – `sampling_logprobs` is from the sampler, recorded during sampling as a correction term.147148This is implemented as:149150```python151# Compute probability ratio152prob_ratio = torch.exp(target_logprobs - sampling_logprobs)153# Compute importance-weighted loss154loss = -(prob_ratio * advantages).sum()155```156157- **Input tensors:**158- `target_tokens: array[(N,), int]` - Target token IDs (from the sampler $q$)159- `logprobs: array[(N,), float]` - `sampling_logprobs` for the tokens160- `advantages: array[(N,), float]` - Advantage values for RL (positive to reinforce, negative to discourage)161- **Output tensors:**162- `logprobs: array[(N,), float]` - `target_logprobs` for the tokens163- **Output diagnostics:**164- `loss:sum` (scalar) - Sum of importance-weighted policy gradient losses $\mathcal L_{\text{IS}}$165166### Proximal Policy Optimization: `ppo`167168PPO ([Schulman et al., 2017](https://arxiv.org/abs/1707.06347)) addresses issues with standard policy gradient methods by introducing a clipping objective that limits policy updates within a close neighborhood of the sampling distribution. This prevents updates that are too large in policy space, especially when taking multiple gradient steps on the same rollout distribution.169170The objective clips the importance ratio $\frac{p_\theta(x)}{q(x)}$ to prevent large policy updates, where $p_\theta$ is the learner policy and $q$ is the sampling policy. Note that the PPO clipping and loss computation is applied token-wise, computing the loss for each token independently.171172The PPO clipping objective is:173174$$175\mathcal{L}_{\text{CLIP}}(\theta) = -\mathbb{E}_{x \sim q}\left[\text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \cdot A(x)\right]176$$177178The final PPO loss combines the clipped and unclipped objectives:179180$$181\mathcal{L}_{\text{PPO}}(\theta) = -\mathbb{E}_{x \sim q}\left[\min\left(\frac{p_\theta(x)}{q(x)} \cdot A(x), \text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \cdot A(x)\right)\right]182$$183184where $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$ are hyperparameters (currently fixed to 0.2 in Tinker).185186This is implemented as:187188```python189# Compute probability ratio190prob_ratio = torch.exp(target_logprobs - sampling_logprobs)191# Apply clipping192clipped_ratio = torch.clamp(prob_ratio, clip_low_threshold, clip_high_threshold)193# Compute both objectives194unclipped_objective = prob_ratio * advantages195clipped_objective = clipped_ratio * advantages196# Take minimum (most conservative)197ppo_objective = torch.min(unclipped_objective, clipped_objective)198# PPO loss is negative of objective199loss = -ppo_objective.sum()200```201202203**Example with custom clipping thresholds:**204```python205fwd_bwd_result = await training_client.forward_backward_async(206data=data,207loss_fn="ppo",208loss_fn_config={"clip_low_threshold": 0.9, "clip_high_threshold": 1.1}209)210```211212**Additional Notes:**213- The loss formulation above is quite general, since the user can organize the data generation and advantage estimation in their own code. For example, the main RL training scripts in the Tinker Cookbook use group-based rollouts with per-group advantage centering similar to GRPO ([Shao et al., 2024](https://arxiv.org/abs/2402.03300)).214- The functional implementations of REINFORCE and PPO do not use an additional KL term like the original GRPO work, which has been noted to be mathematically inconsistent ([Zhang et al., 2025](https://arxiv.org/abs/2505.17508); [Tang et al., 2025](https://arxiv.org/abs/2506.09477)). However, it is possible to include a KL regularization term as part of the reward, which is mathematically correct and we provide this option in our RL training <CookbookLink path="tinker_cookbook/rl/train.py">code and examples</CookbookLink> (consider the incorporate_kl_penalty function).215- Notice that for all objectives we sum the token-level losses over the sequence length unlike some other loss implementations. If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation.216217### Clipped Importance Sampling Policy Optimization: `cispo`218219CISPO ([Chen et al., 2024](https://arxiv.org/abs/2506.13585); [Khatri et al., 2024](https://arxiv.org/abs/2510.13786)) is a policy gradient method that uses a clipped importance ratio as a coefficient for the policy gradient. Unlike PPO which clips the objective directly, CISPO clips the ratio and uses it to weight the log probability. Mathematically the objective is:220The CISPO objective is:221222$$223\mathcal{L}_{\text{CISPO}}(\theta) = \mathbb{E}_{x \sim q}\left[\textbf{sg}\left( \text{clip}\left(\frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\right) \right) \cdot \log p_\theta(x) \cdot A(x)\right]224$$225226This is implemented as:227228```python229# Compute probability ratio230prob_ratio = torch.exp(target_logprobs - sampling_logprobs)231# Apply clipping232clipped_ratio = torch.clamp(prob_ratio, clip_low_threshold, clip_high_threshold)233# Compute CISPO objective (detach the clipped ratio)234cispo_objective = clipped_ratio.detach() * target_logprobs * advantages235# CISPO loss is negative of objective236loss = -cispo_objective.sum()237```238239240Similarly to the PPO objective you can pass loss function parameters in the following way:241242```python243fwd_bwd_result = await training_client.forward_backward_async(244data=data,245loss_fn="cispo",246loss_fn_config={"clip_low_threshold": 0.8, "clip_high_threshold": 1.2}247)248```249250### Direct Reward Optimization: `dro`251252DRO ([Richemond et al., 2024](https://arxiv.org/abs/2405.19107); [Kimi Team et al., 2025](https://arxiv.org/abs/2501.12599)) is a general off-policy (and even offline) reinforcement learning method that uses a quadratic penalty term to constrain the policy update. Notice that this loss uses a different (soft) formulation of the advantage estimation, which needs to be implemented on the client side.253The DRO objective is:254255$$256\mathcal{L}_{\text{DRO}}(\theta) = \mathbb{E}_{x \sim q}\left[\log p_\theta(x) \cdot A(x) - \frac{1}{2}\beta \left(\log \frac{p_\theta(x)}{q(x)}\right)^2\right]257$$258259260This is implemented as:261262```python263# Compute quadratic penalty term264quadratic_term = (target_logprobs - sampling_logprobs) ** 2265# Compute DRO objective266dro_objective = target_logprobs * advantages - 0.5 * beta * quadratic_term267# DRO loss is negative of objective268loss = -dro_objective.sum()269```270271And similarly to other objectives, can specify the loss hyper-parameter as:272273```python274fwd_bwd_result = await training_client.forward_backward_async(275data=data,276loss_fn="dro",277loss_fn_config={"beta": 0.05}278)279```280281## Flexible loss functions: `forward_backward_custom`282283For use cases outside of the above, we've provided the more flexible (but slower) methods `forward_backward_custom` and `forward_backward_custom_async` to compute a more general class of loss functions.284285### Usage286287Here's a simple example of a custom loss function:288289```python290def logprob_squared_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:291loss = (logprobs ** 2).sum()292return loss, {"logprob_squared_loss": loss.item()}293```294295You can call this loss function with `forward_backward_custom` like:296297```python298loss, metrics = training_client.forward_backward_custom(data, logprob_squared_loss)299```300301You can also define loss functions which operate on multiple sequences at a time. For example, a loss function that computes the variance across the sequences (although practically useless) can be implemented as:302303```python304def variance_loss(data: list[Datum], logprobs: list[torch.Tensor]) -> tuple[torch.Tensor, dict[str, float]]:305flat_logprobs = torch.cat(logprobs)306variance = torch.var(flat_logprobs)307return variance, {"variance_loss": variance.item()}308```309310A more practical use case would be to compute a Bradley-Terry loss on pairwise comparison data -- a classic approach in RL from human feedback, as introduced and popularized by [Learning to Summarize](https://arxiv.org/abs/2009.01325). Similarly, we can also implement [Direct Preference Optimization](https://arxiv.org/abs/2305.18290), which also computes a loss involving pairs of sequences; see the [DPO guide](/preferences/dpo-guide) for more details.311312If you're using a custom loss function that you think is generally useful, please let us know, and we'll add it to the list of built-in loss functions.313314We detail the `async` version of methods in the [Async and Futures](./async) of these docs.315316### How `forward_backward_custom` works317318---319320## File: publish-weights.mdx321322# Publishing weights323324If you've trained a model that you'd like to share with the community, you can325publish any number of checkpoints you've previously saved.326327Once published, your checkpoint can be loaded by any tinker user and used to328further train a new model or be sampled against.329330### Publishing331332```bash333tinker checkpoint publish $TINKER_CHECKPOINT_PATH334```335336where `$TINKER_CHECKPOINT_PATH` is a checkpoint path in the form of `tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af:train:17/weights/checkpoint_id_to_publish`.337338You may confirm your checkpoint is published by dumping the checkpoint info and checking the `Public` property:339340```bash341tinker checkpoint info tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af/weights/checkpoint_id_to_publish342Checkpoint: weights/checkpoint_id_to_publish343┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓344┃ Property ┃ Value ┃345┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩346│ Checkpoint ID │ weights/checkpoint_id_to_publish │347│ Type │ training │348│ Tinker Path │ tinker://14bdf3a1-0b95-55c7-8659-5edb1bc870af/weights/checkpoint_id_to_publish │349│ Size │ 342.4 MB │350│ Public │ No │351│ Created │ 23 minutes ago │352│ Training Run ID │ 14bdf3a1-0b95-55c7-8659-5edb1bc870af │353└─────────────────┴────────────────────────────────────────────────────────────────────────────────┘354```355356### Unpublishing357358```bash359tinker checkpoint unpublish $TINKER_CHECKPOINT_PATH`360```361362### Loading public weights363364Loading public weights is exactly the same as loading a non-public one:365366```python367ckpt_path = ...368training_client = service_client.create_training_client_from_state(ckpt_path)369```370371372---373374## File: supervised-learning.mdx375376import { CookbookLink } from '../components/CookbookLink'377378# Cookbook: Supervised learning379380This section takes you through examples from the Tinker Cookbook that relate to supervised learning.381382In general, supervised learning (SL) means learning an input-output mapping from labeled data. In the context of language model fine-tuning, this means **minimizing a weighted cross-entropy loss** on token sequences---equivalently, maximizing the log-probability of the specified target tokens.383384There are a few ways that SL is commonly used in LLM fine-tuning pipelines:385386- *Instruction tuning*: This is the first step in post-training pipelines, applied to the base (raw, pretrained) model. Typically, we do SL on a high-quality dataset that demonstrates the correct format and style, while boosting the model's reasoning and instruction-following.387- *Context distillation* / *prompt distillation*: let's say we have a generic model that can do chat / instruction following / reasoning, but we want to adjust how it behaves in a certain scenario. We can add some instructions to the system message of our model. However, the system message might grow impractically long and start ignoring some of its instructions. So it's often better to create a supervised dataset on a narrow prompt distribution, with a shorter set of instructions that that are targeted at these prompts.388389We'll cover both of these use cases in this documentation and related Cookbook code.390391The library code implementing supervised learning can be found in the <CookbookLink path="tinker_cookbook/supervised">`supervised`</CookbookLink> directory.392393394---395396## File: preferences.mdx397398import { CookbookLink } from '../components/CookbookLink'399400# Preferences401402# Learning from Preferences403404In this section, we focus on learning from **pairwise feedback**, where we have preference data indicating which of two completions is better for a given prompt. This kind of feedback is a natural fit for tasks where there's not a simple correctness criterion that can be computed programmatically. These preferences might be collected from human evaluators or generated bya model.405406## Two Approaches to Preference Learning407408When you have pairwise preference data, there are two main approaches:4094101. **Direct Preference Optimization (DPO)**: Directly update the policy to prefer chosen responses over rejected ones, without needing a separate reward model. This is simpler and computationally cheaper. See the [DPO Guide](/preferences/dpo-guide) for details.4114122. **Reinforcement Learning from Human Feedback (RLHF)**: Train a reward model on preference data, then use reinforcement learning to optimize the policy against this reward model. This two-stage approach provides more flexibility. See the the [RLHF example](/preferences/rlhf-example) for details.413414415---416417## File: docs-outline.mdx418419# Navigating these docs420421These docs provide guides to both Tinker and the Tinker Cookbook.422423The first half, "Using the Tinker API", walks you through the fundamentals of Tinker:424425- [Installation](./install) explains how to install both `tinker` and `tinker-cookbook`, and points you to the Tinker Console for your API key.426- [Training and Sampling](./training-sampling) takes you through your first training run: setting up your training data, performing the run, and sampling from the model to test the run.427- [Loss Functions](./losses) starts to get into the detail. Tinker supports a variety of built-in loss function, but also allows you to use arbitrary differentiable loss functions.428- [Saving and Loading](./save-load) explains the checkpoint types available in Tinker, and how to restart a run from a checkpoint.429- [Async and Futures](./async) explains Tinker's `sync` and `async` API variants, and how Futures works as Tinker's requests structure.430- [Model Lineup](./model-lineup) is regularly updated with the models available to fine-tune in Tinker.431432The second half, "The Tinker Cookbook", provides recipes for how to use the Tinker API for research and applications. You are welcome to adapt these directly for your own use cases.433434- [Rendering](./rendering) explains how we convert from a conversation data structure to a list of tokens.435- [Supervised Learning](./supervised-learning) explains basic SL and walks you through your first SL training loop. We make some suggestions for hyperparameter selection and detail how you can run your own hyperparameter sweep. We also show you how to perform prompt distillation.436- [Reinforcement Learning](./rl) explains the basics of RL and walks you through your first RL run. We explain and provide code for creating your own RL environments and training on them. We provide a simple training loop for you to use and adapt, and explain RL hyperparameters and loss functions in detail.437- [Preferences](./preferences) is a guide to learning from pairwise feedback, where we have preference data indicating which of two completions is better for a given prompt. We walk you through two approaches to learning from pairwise preference data: direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF).438- [Evaluations](./evals) explains how you can use Tinker's outputs to run inline and offline evals on your runs.439- [Completers](./completers) explains how Tinker implements policies, and provides two examples of how to use these in training.440- [LoRA Primer](./lora-primer) explains the basic background of LoRA, and how to choose hyperparameters.441442443---444445## File: lora-primer.mdx446447# LoRA Primer448449Tinker supports [LoRA fine-tuning](https://arxiv.org/abs/2106.09685), which adjusts a small number of parameters, rather than full fine-tuning, which adjusts all of the parameters of the original model.450451Our current understanding is that LoRA has equivalent performance to full fine-tuning when doing RL or doing SL on small datasets, while it has worse performance on larger datasets. In more detail:452453- For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.454- For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.455- In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix.456- Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.457- LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.458459See [LoRA Without Regret](https://thinkingmachines.ai/blog/lora) for more details and experimental results.460461## Hyperparameters462463The learning rate (LR) is usually the most important hyperparameter in your ML experiments.464465466LoRA requires a much larger LR than full fine-tuning---typically 20-100x larger, depending on model size. People often mistakenly retain their full fine-tuning LR when they port their code to use LoRA, leading them to conclude that LoRA works poorly.467468**Calculate the correct LoRA learning rate:**469470We've provided a utility that calculates the factor you should scale the full fine-tuning LR by to get the equivalent LoRA LR:471472```python473from tinker_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr474475model_name = "meta-llama/Llama-3.1-8B"476print(get_lora_lr_over_full_finetune_lr(model_name))477```478479Note that for `Llama-3.2-1B`, the factor is 32, while for `Llama-3.1-70B`, the factor is 128.480481## What is LoRA exactly?482483LoRA is short for Low-Rank Adaptation. Given that the original model has a weight matrix $W$, we replace it with a new weight matrix $W'=W + BA$, where $B$ and $A$ are low-rank matrices. If $W$ is an $n \times n$ matrix, then $B$ and $A$ are $n \times r$ and $r \times n$ matrices, respectively, where $r$ is the rank of the low-rank approximation. The default $r$ used by tinker is $32$.484485The fact that LoRA uses a low-rank approximation of weight matrices is not terribly important. We prefer to think of LoRA as just a random projection of the parameter space that happens to be efficient to implement. When training with RL or small SL datasets, we are only learning a small amount of information, and this reduced set of parameters is more than enough.486487488## What rank to use?489490The default rank used by tinker is $32$. However, if you're doing SL on a large dataset, you should use a larger rank. For supervised learning, as a very rough approximation, LoRA will give good results as long as the number of LoRA parameters is at least as large as the number of completion tokens (i.e., weight=1 tokens). You can calculate the number of LoRA parameters with the following utility:491492```python493from tinker_cookbook.hyperparam_utils import get_lora_param_count494495model_name = "meta-llama/Llama-3.1-8B"496print(get_lora_param_count(model_name, lora_rank=32))497```498499For reinforcement learning, we've found that small ranks give equivalent performance to larger ranks and full fine-tuning.500501Note that conveniently, the optimal learning rate does *not* depend on the LoRA rank. In fact, you can verify that if you train with SL on different ranks (but with the same LR), you'll get exactly the same learning curves for the first few steps of training.502503504---505506## File: evals.mdx507508import { Callout } from 'nextra/components'509import { CookbookLink } from '../components/CookbookLink'510511# Evaluations512513Our training scripts will print out training and test loss. Two common workflows for evaluations are to do inline evals during training and to do offline evals on various checkpoints from a run.514515## Inline Evals516517You can add inline evaluations to your training runs by configuring evaluator builders in advance for both supervised fine-tuning and RL training jobs.518519### Supervised Fine-Tuning (`supervised.train`)520Add one or both of the following to your config:521522- **`evaluator_builders: list[EvaluatorBuilder]`** - Runs evaluations every `eval_every` steps523- **`infrequent_evaluator_builders: list[EvaluatorBuilder]`** - Runs evaluations every `infrequent_eval_every` steps524525### RL Training (`rl.train`)526527Add the following to your config:528529- **`evaluator_builders: list[SamplingClientEvaluator]`** - Runs evaluations every `eval_every` steps530531For implementation guidance and a detailed example, see <CookbookLink path="tinker_cookbook/eval/evaluators.py">here</CookbookLink> and532<CookbookLink path="tinker_cookbook/eval/inspect_evaluators.py">here</CookbookLink> respectively.533534535## Offline evals536537We support and recommend several ways for creating and running your offline evaluations on your model checkpoints.538539### Running Standard Evaluations with Inspect AI.540541We support running many of the standard cited evaluations using the [Inspect AI library](https://github.com/UKGovernmentBEIS/inspect_ai).542543We have provided a <CookbookLink path="tinker_cookbook/eval/run_inspect_evals.py">script</CookbookLink> to evaluate models using Tinker's internal sampling functionality as shown below.544545```bash546MODEL_PATH=tinker://FIXME # YOUR MODEL PATH HERE547python -m tinker_cookbook.eval.run_inspect_evals \548model_path=$MODEL_PATH \549model_name=MODEL_NAME \ # YOUR MODEL_NAME HERE550tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \551renderer_name=RENDERER_NAME # YOUR RENDERER_NAME HERE552```553554Click [here](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/docs/evals/listing.yml) to view additional supported evaluations.555556### Creating your own Sampling Evaluations557558We recommend two ways to create your own evaluations:559- creating your own tasks with Inspect AI and running like above560- creating your own SamplingClientEvaluator561562#### Create tasks with Inspect AI563564In addition to passing in standard evaluations, you can create your own tasks using inspect ai as detailed [here](https://inspect.aisi.org.uk/tasks.html).565566Here is a toy example of how to create an evaluation with an LLM-as-a-judge where we use a model produced by tinker as a grader.567568```python569import tinker570from inspect_ai import Task, task571from inspect_ai.dataset import MemoryDataset, Sample572from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig573from inspect_ai.model import Model as InspectAIModel574from inspect_ai.scorer import model_graded_qa575from inspect_ai.solver import generate576from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling577578QA_DATASET = MemoryDataset(579name="qa_dataset",580samples=[581Sample(582input="What is the capital of France?",583target="Paris",584),585Sample(586input="What is the capital of Italy?",587target="Rome",588),589],590)591592service_client = tinker.ServiceClient()593sampling_client = service_client.create_sampling_client(594base_model="meta-llama/Llama-3.1-8B-Instruct"595)596597api = InspectAPIFromTinkerSampling(598renderer_name="llama3",599model_name="meta-llama/Llama-3.1-8B-Instruct",600sampling_client=sampling_client,601verbose=False,602)603604GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())605606607@task608def example_lm_as_judge() -> Task:609"""610Example task using LLM-as-a-judge scoring.611612Note: The grader model defaults to the model being evaluated.613To use a different grader model, specify it with --model-grader when using inspect directly.614"""615return Task(616name="llm_as_judge",617dataset=QA_DATASET,618solver=generate(),619scorer=model_graded_qa(620instructions="Grade strictly against the target text as general answer key and rubric. "621"Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",622partial_credit=False,623# model parameter is optional - if not specified, uses the model being evaluated624model=GRADER_MODEL,625),626)627```628629Inspect also natively supports replacing our `GRADER_MODEL` with any openai-chat-completion style api (e.g. openrouter).630631#### Create your own SamplingClientEvaluator632633Alternatively, you can create your own SamplingClientEvaluator class instead of using Inspect AI. This is a lower634level abstraction than the above with finer-grain control over running your evaluations.635636We expose this to interface to allow users more control over their datasets and metrics. To illustrate, see this637<CookbookLink path="tinker_cookbook/eval/custom_evaluators.py">custom evaluators</CookbookLink> example of how one might create their own complex SamplingClientEvaluator.638639For a more illustrative toy instructive example see below.640641```python642from typing import Any, Callable643644import tinker645from tinker import types646647from tinker_cookbook import renderers648from tinker_cookbook.evaluators import SamplingClientEvaluator649from tinker_cookbook.tokenizer_utils import get_tokenizer650651class CustomEvaluator(SamplingClientEvaluator):652"""653A toy SamplingClientEvaluator that runs a custom evaluation and returns its metrics.654"""655656def __init__(657self,658dataset: Any,659grader_fn: Callable[[str, str], bool],660model_name: str,661renderer_name: str,662):663"""664Initialize the CustomEvaluator.665Args:666config: Configuration object containing all evaluation parameters667"""668self.dataset = dataset669self.grader_fn = grader_fn670671tokenizer = get_tokenizer(model_name)672self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)673674async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:675"""676Run custom evaluation on the given sampling client and return metrics.677Args:678sampling_client: The sampling client to evaluate679Returns:680Dictionary of metrics from inspect evaluation681"""682683metrics = {}684685num_examples = len(self.dataset)686num_correct = 0687688sampling_params = types.SamplingParams(689max_tokens=100,690temperature=0.7,691top_p=1.0,692stop=self.renderer.get_stop_sequences(),693)694695for datum in self.dataset:696model_input: types.ModelInput = self.renderer.build_generation_prompt(697[renderers.Message(role="user", content=datum["input"])]698)699# Generate response700r: types.SampleResponse = await sampling_client.sample_async(701prompt=model_input, num_samples=1, sampling_params=sampling_params702)703tokens: list[int] = r.sequences[0].tokens704response: renderers.Message = self.renderer.parse_response(tokens)[0]705if self.grader_fn(response["content"], datum["output"]):706num_correct += 1707708metrics["accuracy"] = num_correct / num_examples709return metrics710```711712Here is an example of how we can use the above CustomEvaluator on a toy dataset and grader.713714715```python716QA_DATASET = [717{"input": "What is the capital of France?", "output": "Paris"},718{"input": "What is the capital of Germany?", "output": "Berlin"},719{"input": "What is the capital of Italy?", "output": "Rome"},720]721722def grader_fn(response: str, target: str) -> bool:723return target.lower() in response.lower()724725evaluator = CustomEvaluator(726dataset=QA_DATASET,727grader_fn=grader_fn,728renderer_name="llama3",729model_name="meta-llama/Llama-3.1-8B-Instruct",730731)732733service_client = tinker.ServiceClient()734sampling_client = service_client.create_sampling_client(base_model="meta-llama/Llama-3.1-8B-Instruct")735736async def main():737result = await evaluator(sampling_client)738print(result)739740asyncio.run(main())741```742743744---745746## File: dev-tips.mdx747748# Developer Tips749750## AI-assisted development751752We've provided a single-file version of the documentation that can be fed to LLMs for development: see [llms.txt](/llms.txt) and [llms-full.txt](/llms-full.txt).753754755---756757## File: async.mdx758759# Async and Futures760761## Sync and Async APIs762763Every method in the Tinker Python library has both a synchronous (sync) and an asynchronous (async) version. The async variants end with `_async`:764765| **Client** | **Sync method** | **Async method** |766|---|---|---|767| `ServiceClient` | `create_lora_training_client()` | `create_lora_training_client_async()` |768| `TrainingClient` | `forward()` | `forward_async()` |769| `SamplingClient` | `sample()` | `sample_async()` |770| `RestClient` | `list_training_run_ids()` | `list_training_run_ids_async()` |771772Tinker's `async` functionality requires an `asyncio` event loop, which you typically run like `asyncio.run(main())`.773774**When to use each:**775776- **Async:** Best for high-performance workflows where you need concurrency, especially when waiting on multiple network calls.777- **Sync:** Simpler for scripts and learning examples. Easier to reason about but blocks on each operation.778779The Tinker Cookbook generally uses `async` for implementations where performance is critical and sync for pedagogical examples.780781## Understanding Futures782783Most Tinker API methods are **non-blocking**, but may take a little while to run. They return immediately with a `Future` object that acknowledges that your request has been submitted. To get the actual result, you must explicitly wait:784785**Sync Python:**786```python787future = client.forward_backward(data, loss_fn)788result = future.result() # Blocks until complete789```790791**Async Python (note the double await):**792```python793future = await client.forward_backward_async(data, loss_fn)794result = await future795```796797After the first `await`, you're guaranteed that the request has been submitted, which ensures that it'll be ordered correctly relative to other requests. The second `await` waits for the actual computation to finish and returns the numerical outputs. For operations like `forward_backward`, the second `await` also guarantees that operation has been applied to the model---for `forward_backward`, this means that the gradients have been accumulated in the model's optimizer state.798799## Performance tips: overlap requests800801For best performance, you should aim to submit your next request while the current one is running. Doing so is more important with Tinker than with other training systems because Tinker training runs on discrete [clock cycles](./under-the-hood#clock-cycles) (~10 seconds each). If you don't have a request queued when a cycle starts, you'll miss that cycle entirely.802803**Example pattern for overlapping forward_backward and optim_step:**804```python805# Submit forward_backward806fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)807808# Submit optim_step immediately (don't wait for forward_backward to finish)809optim_future = await client.optim_step_async(adam_params)810811# Now retrieve results812fwd_bwd_result = await fwd_bwd_future813optim_result = await optim_future814```815816This pattern ensures both operations are queued and can be processed in the same [clock cycle](./under-the-hood#clock-cycles). In contrast, if you waited for `forward_backward` to complete before submitting `optim_step`, you might miss the next [clock cycle](./under-the-hood#clock-cycles).817818819---820821## File: download-weights.mdx822823# Downloading weights824825### CLI826827```bash828tinker checkpoint download $TINKER_CHECKPOINT_PATH829```830831See `tinker checkpoint download --help` for more details.832833### SDK834835You can also download checkpoints using the SDK.836837Example:838839```python840import tinker841import urllib.request842843sc = tinker.ServiceClient()844rc = sc.create_rest_client()845future = rc.get_checkpoint_archive_url_from_tinker_path("tinker://<unique_id>/sampler_weights/final")846checkpoint_archive_url_response = future.result()847848# `checkpoint_archive_url_response.url` is a signed URL that can be downloaded849# until checkpoint_archive_url_response.expires850urllib.request.urlretrieve(checkpoint_archive_url_response.url, "archive.tar")851```852853Replace `<unique_id>` with your Training Run ID. This will save the LoRA adapter weights and config inside the `archive.tar` file.854855856---857858## File: overview-building.mdx859860# Overview: Tinker Cookbook861862The next sections provide a variety of guides for how to use the Tinker API for research and applications.863864We expect people to use Tinker in a few different ways:8658661. You want to define datasets and environments and plug them into existing training code from the Tinker Cookbook.8672. You want to write your own training loops from scratch, starting with the basics.8683. You want to understand the classes and other concepts in Tinker Cookbook so you can extend them to add new functionality.869870Different parts of the docs will be tailored to these different approaches.871872We'll start with a couple of general pages that'll be relevant to almost all of the use cases:873874- [Rendering to Tokens](./rendering.mdx) -- how we convert from a conversation data structure to a list of tokens (a.k.a. chat templates).875- [LoRA Primer](./lora-primer.mdx) -- basic background of LoRA, and how to choose hyperparameters. For most fine-tuning applications, LoRA will give results that are roughly the same as full fine-tuning, however, you need to use different learning rates.876877878---879880## File: save-load.mdx881882# Saving and loading weights and optimizer state883884During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides three methods to handle these cases:8858861. `save_weights_for_sampler()`: saves a copy of the model weights that can be used for sampling.8872. `save_state()`: saves the weights and the optimizer state. You can fully resume training from this checkpoint.8883. `load_state()`: load the weights and the optimizer state. You can fully resume training from this checkpoint.889890Note that (1) is faster and requires less storage space than (2).891892Both `save_*` functions require a `name` parameter---a string that you can set to identify the checkpoint within the current training run. For example, you can name your checkpoints `"0000"`, `"0001"`, `"step_1000"`, etc.893894The return value contains a `path` field, which is a fully-qualified path, which will look something like `tinker://<model_id>/<name>`. This path is persistent and can be loaded later by a new `ServiceClient` or `TrainingClient`.895896### Example: Saving for sampling897898```python899# Setup900import tinker901service_client = tinker.ServiceClient()902training_client = service_client.create_lora_training_client(903base_model="meta-llama/Llama-3.2-1B", rank=32904)905906# Save a checkpoint that you can use for sampling907sampling_path = training_client.save_weights_for_sampler(name="0000").result().path908909# Create a sampling client with that checkpoint910sampling_client = service_client.create_sampling_client(model_path=sampling_path) #911```912913**Shortcut:** Combine these steps with:914915```python916sampling_client = training_client.save_weights_and_get_sampling_client(name="0000")917```918919### Example: Saving to resume training920921Use `save_state()` and `load_state()` when you need to pause and continue training with full optimizer state preserved:922923```python924# Save a checkpoint that you can resume from925resume_path = training_client.save_state(name="0010").result().path926927# Load that checkpoint928training_client.load_state(resume_path)929```930931### When to use `save_state()` and `load_state()`:932933934- Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)935- Adjusting hyperparameters or data mid-run936- Recovery from interruptions or failures937- Any scenario where you need to preserve exact optimizer state (momentum, learning rate schedules, etc.)938939940---941942## File: training-sampling.mdx943944import { Callout } from 'nextra/components'945946# Getting started with training and sampling947948In this guide, we'll step you through using the Tinker Python library to do the basic operations needed for training and sampling.949[View the complete Python script →](/quickstart.py.txt)950951## Creating the training client952953The main object we'll be using is the `TrainingClient`, which corresponds to a fine-tuned model that we can train and sample from.954955First, set your Tinker API key environment variable. In the terminal where you'll run Python, or in your `.bashrc`, put `export TINKER_API_KEY=<your key>`.956957Then, create a `ServiceInterface`. This lets you find out what base models are available to be fine-tuned.958959```python960import tinker961service_client = tinker.ServiceClient()962print("Available models:")963for item in service_client.get_server_capabilities().supported_models:964print("- " + item.model_name)965```966You'll see a list of model names:967```968- meta-llama/Llama-3.1-70B969- meta-llama/Llama-3.1-8B970...971- Qwen/Qwen3-VL-30B-A3B-Instruct972- Qwen/Qwen3-VL-235B-A22B-Instruct973```974We currently support models from the Qwen3, Qwen3-VL, and Llama3 series. We'll use Qwen3-VL-30B-A3B-Instruct for these examples, as it's a vision-language model that can also handle text-only tasks. See [Available Models in Tinker](/model-lineup) for the full list.975976Now we can create the `TrainingClient`:977```python978base_model = "Qwen/Qwen3-VL-30B-A3B-Instruct"979training_client = service_client.create_lora_training_client(980base_model=base_model981)982```983As the name suggests, this model was already finetuned for chat/instruction-following. You should check the details of the model you're using in their system cards.984985## Preparing the training data986987Now we can do training updates on the model. This quickstart example won't show best practices for LLM fine-tuning; it's just an API demo. Check out [Rendering](/rendering), [Supervised Fine-tuning](/supervised-learning) and the other Cookbook examples for guidance on how to use Tinker in real applications.988989For this model, we'll train a model that can translate words into Pig Latin. The rules for Pig Latin are simple:990- If a word begins with a consonant, move it to the end and add "ay"991- If a word begins with a vowel, just add "way" to the end992993Here are some example completions we'd like the model to perform, where the prompt is in green and the model's completion is in red:994995<div className="example">996<span className="prompt">English: hello world<br/>997Pig Latin: </span><span className="completion">ello-hay orld-way</span>998</div>9991000Let's create some training examples and convert them to a format expected by Tinker.10011002```python1003# Create some training examples1004examples = [1005{1006"input": "banana split",1007"output": "anana-bay plit-say"1008},1009{1010"input": "quantum physics",1011"output": "uantum-qay ysics-phay"1012},1013{1014"input": "donut shop",1015"output": "onut-day op-shay"1016},1017{1018"input": "pickle jar",1019"output": "ickle-pay ar-jay"1020},1021{1022"input": "space exploration",1023"output": "ace-spay exploration-way"1024},1025{1026"input": "rubber duck",1027"output": "ubber-ray uck-day"1028},1029{1030"input": "coding wizard",1031"output": "oding-cay izard-way"1032},1033]10341035# Convert examples into the format expected by the training client1036from tinker import types10371038# Get the tokenizer from the training client1039tokenizer = training_client.get_tokenizer()10401041def process_example(example: dict, tokenizer) -> types.Datum:1042# Format the input with Input/Output template1043# For most real use cases, you'll want to use a renderer / chat template,1044# (see later docs) but here, we'll keep it simple.1045prompt = f"English: {example['input']}\nPig Latin:"10461047prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)1048prompt_weights = [0] * len(prompt_tokens)1049# Add a space before the output string, and finish with double newline1050completion_tokens = tokenizer.encode(f" {example['output']}\n\n", add_special_tokens=False)1051completion_weights = [1] * len(completion_tokens)10521053tokens = prompt_tokens + completion_tokens1054weights = prompt_weights + completion_weights10551056input_tokens = tokens[:-1]1057target_tokens = tokens[1:] # We're predicting the next token, so targets need to be shifted.1058weights = weights[1:]10591060# A datum is a single training example for the loss function.1061# It has model_input, which is the input sequence that'll be passed into the LLM,1062# loss_fn_inputs, which is a dictionary of extra inputs used by the loss function.1063return types.Datum(1064model_input=types.ModelInput.from_ints(tokens=input_tokens),1065loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)1066)10671068processed_examples = [process_example(ex, tokenizer) for ex in examples]10691070# Visualize the first example for debugging purposes1071datum0 = processed_examples[0]1072print(f"{'Input':<20} {'Target':<20} {'Weight':<10}")1073print("-" * 50)1074for i, (inp, tgt, wgt) in enumerate(zip(datum0.model_input.to_ints(), datum0.loss_fn_inputs['target_tokens'].tolist(), datum0.loss_fn_inputs['weights'].tolist())):1075print(f"{repr(tokenizer.decode([inp])):<20} {repr(tokenizer.decode([tgt])):<20} {wgt:<10}")1076```10771078The visualization of the first example is:10791080```1081Input Target Weight1082--------------------------------------------------1083'English' ':' 0.01084':' ' I' 0.01085' I' ' love' 0.01086' love' ' tink' 0.01087' tink' 'ering' 0.01088'ering' '\n' 0.01089'\n' 'P' 0.01090'P' 'ig' 0.01091'ig' ' Latin' 0.01092' Latin' ':' 0.01093':' ' I' 1.01094' I' '-way' 1.01095'-way' ' o' 1.01096' o' 've' 1.01097've' '-l' 1.01098'-l' 'ay' 1.01099'ay' ' ink' 1.01100' ink' 'ering' 1.01101'ering' '-t' 1.01102'-t' 'ay' 1.01103'ay' '<|endoftext|>' 1.01104```11051106## Vision inputs11071108The above example is text-only, but adding vision inputs is also straightforward. The `ModelInput` type takes a list of chunks, which can be either `EncodedTextChunk` or `ImageChunk`. For instance:11091110```python1111image_data = requests.get("https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png").content1112model_input = tinker.ModelInput(chunks=[1113types.EncodedTextChunk(tokens=tokenizer.encode("<|im_start|>user\n<|vision_start|>")),1114types.ImageChunk(data=image_data, format="png"),1115types.EncodedTextChunk(tokens=tokenizer.encode("<|vision_end|>What is this?<|im_end|>\n<|im_start|>assistant\n")),1116])1117```11181119Note that Qwen3-VL was trained with special tokens like `<|vision_start|>` and `<|vision_end|>`. The cookbook's `Qwen3VLRenderer` handles these automatically—see [Rendering: Vision Inputs](/rendering#vision-inputs) for details and a complete example.11201121## Performing a training update11221123Now we can use this data to perform a training update. We'll do 6 updates on the same batch of data. (Note that this is not typically a good way to train!)11241125```python1126import numpy as np1127for _ in range(6):1128fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")1129optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))11301131# Wait for the results1132fwdbwd_result = fwdbwd_future.result()1133optim_result = optim_future.result()11341135# fwdbwd_result contains the logprobs of all the tokens we put in. Now we can compute the weighted1136# average log loss per token.1137logprobs = np.concatenate([output['logprobs'].tolist() for output in fwdbwd_result.loss_fn_outputs])1138weights = np.concatenate([example.loss_fn_inputs['weights'].tolist() for example in processed_examples])1139print(f"Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}")1140```11411142Note that the `forward_backward` and `optim_step` functions immediately return *futures*, which acknowledge that the task has been queued up by the server. For improved speed, we submitted both operations before waiting for the result by calling `result()` on the futures.114311441145## Sampling from the model11461147Now we can test our model by sampling from it. In this case, we'll translate the phrase "coffee break" into Pig Latin.11481149```python1150# First, create a sampling client. We need to transfer weights1151sampling_client = training_client.save_weights_and_get_sampling_client(name='pig-latin-model')11521153# Now, we can sample from the model.1154prompt = types.ModelInput.from_ints(tokenizer.encode("English: coffee break\nPig Latin:"))1155params = types.SamplingParams(max_tokens=20, temperature=0.0, stop=["\n"]) # Greedy sampling1156future = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=8)1157result = future.result()1158print("Responses:")1159for i, seq in enumerate(result.sequences):1160print(f"{i}: {repr(tokenizer.decode(seq.tokens))}")1161```11621163Since sampling is nondeterministic (sadly, even with temperature=0.0, [due to batching](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)), the output will be different each time. You should see something like this:11641165```1166Responses:11670: ' offe-bay eak-bay\n\n'11681: ' offey-coy eak-bray\n\n'11692: ' offecay eakbray\n\n'11703: ' offeec-cay eak-brcay\n\n\n'11714: ' offecay akebay\n\n'11725: ' offee-Cay ake-bay\n\n\n'11736: ' offey-pay eak-bray\n\n'11747: ' offee – cay eak – bray\n\n'1175```11761177### Computing logprobs for a sequence11781179We can use the sampler to compute logprobs for a given sequence as well. This uses the prefill step and is returned as _prompt logprobs_.11801181```python1182prompt = types.ModelInput.from_ints(tokenizer.encode("How many r's are in the word strawberry?"))1183sample_response = sampling_client.sample(1184prompt=prompt,1185num_samples=1,1186sampling_params=tinker.SamplingParams(max_tokens=1), # Must be at least 1 token, represents prefill step1187include_prompt_logprobs=True,1188).result()11891190# example: [None, -9.54505, -1.64629, -8.81116, -3.50217, -8.25927, ...]1191print(sample_response.prompt_logprobs)1192```11931194The first logprob is `None` (corresponding to the first token), and subsequent entries are logprobs of each token in the prompt.11951196The sampling client also has a helper function, which is the same as above:11971198```python1199sampling_client.compute_logprobs(prompt).result()1200```12011202### Top-k logprobs12031204For distillation, it may be especially useful to compute _top-k logprobs_ for each token as well, which can get you a sense for what the model "would have said" after each prefix instead of the actual prompt.12051206```python1207sample_response = sampling_client.sample(1208prompt=prompt,1209num_samples=1,1210sampling_params=tinker.SamplingParams(max_tokens=1),1211include_prompt_logprobs=True,1212topk_prompt_logprobs=5,1213).result()12141215# example:1216# [None,1217# [(14924, -1.17005), (755, -2.23255), (2, -2.73255), (791, -3.67005), (16309, -4.29505)],1218# [(25, -1.64629), (3137, -2.39629), (11630, -2.89629), (21460, -3.83379), (14881, -4.02129)],1219# [(41, -3.49866), (42, -3.49866), (49, -4.24866), (38, -4.37366), (54, -4.49866)],1220# [(311, -1.00217), (656, -2.25217), (2057, -2.75217), (649, -3.25217), (10470, -3.37717)],1221# ...]1222sample_response.topk_prompt_logprobs1223```12241225For each position in the response, this returns a list of `(token_id, logprob)` pairs for the top-k most likely tokens at that position.12261227## Putting it together: Sampling from an image12281229Here's a complete example that creates a training client, saves weights for sampling, and asks a question about an image. You can copy-paste it into an iPython notebook:12301231```python1232import requests1233import tinker1234from transformers import AutoTokenizer12351236model_name = "Qwen/Qwen3-VL-30B-A3B-Instruct"1237tokenizer = AutoTokenizer.from_pretrained(model_name)12381239service_client = tinker.ServiceClient()1240training_client = await service_client.create_lora_training_client_async(base_model=model_name, rank=32)1241sampling_client = await training_client.save_weights_and_get_sampling_client_async(name="sampler")12421243# Grab an image and ask a question1244image_data = requests.get("https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png").content1245model_input = tinker.ModelInput(chunks=[1246tinker.types.EncodedTextChunk(tokens=tokenizer.encode("<|im_start|>user\n<|vision_start|>")),1247tinker.types.ImageChunk(data=image_data, format="png"),1248tinker.types.EncodedTextChunk(tokens=tokenizer.encode("<|vision_end|>What is this?<|im_end|>\n<|im_start|>assistant\n")),1249])12501251result = await sampling_client.sample_async(prompt=model_input, num_samples=1, sampling_params=tinker.types.SamplingParams(max_tokens=100))1252print(tokenizer.decode(result.sequences[0].tokens))1253```12541255For higher-level abstractions that handle special tokens automatically, see [Rendering: Vision Inputs](/rendering#vision-inputs).125612571258---12591260## File: rendering.mdx12611262import { CookbookLink } from '../components/CookbookLink'126312641265# Rendering to tokens12661267Rendering converts list-of-message datatypes into their token representations for model training and inference. While similar to [chat templates](https://huggingface.co/docs/transformers/en/chat_templating), Tinker's rendering system is designed for the full training lifecycle--not just inference--supporting supervised learning, reinforcement learning, and deployment.126812691270## The Renderer class12711272The Renderer class is the main interface used for rendering. It can be found in <CookbookLink path="tinker_cookbook/renderers.py">`renderers.py`</CookbookLink>.12731274**Example conversation:**12751276```python1277messages =[1278{'role': 'system', 'content': 'Answer concisely; at most one sentence per response'},1279{'role': 'user', 'content': 'What is the longest-lived rodent species?'},1280{'role': 'assistant', 'content': 'The naked mole rat, which can live over 30 years.'},1281{'role': 'user', 'content': 'How do they live so long?'},1282{'role': 'assistant', 'content': 'They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.'}1283]1284```12851286We'll use this conversation throughout the examples below.12871288## Inference: Generating messages12891290Our model maps tokens to tokens, but with the renderer, it can map messages to messages. To sample messages from the model, we need to use three methods from the renderer:12911292- `build_generation_prompt`1293- `get_stop_sequences`1294- `parse_response`129512961297`build_generation_prompt` converts a conversation into a prompt that we can use to sample from the assistant. This is used during reinforcement learning and at deployment time.129812991300**Example: Generate an alternative assistant response**13011302Let's remove the last assistant message and call `build_generation_prompt` to get a prompt that we can use to sample an alternative response from the assistant:13031304```python1305from tinker_cookbook import renderers, tokenizer_utils1306tokenizer = tokenizer_utils.get_tokenizer('Qwen/Qwen3-30B-A3B')1307renderer = renderers.get_renderer('qwen3', tokenizer)1308prompt = renderer.build_generation_prompt(messages[:-1])1309print(prompt)1310print('-'*10)1311print(tokenizer.decode(prompt.to_ints()))1312```13131314**Output:**1315```1316ModelInput(chunks=[EncodedTextChunk(tokens=[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 8948, 198, 16141, 3529, 285, 974, 26, 518, 1429, 825, 11652, 817, 2033, 151645, 198, 151644, 872, 198, 3838, 374, 279, 22032, 61854, 20589, 306, 9419, 30, 151645, 198, 151644, 77091, 198, 785, 19020, 34651, 11244, 11, 892, 646, 3887, 916, 220, 18, 15, 1635, 13, 151645, 198, 151644, 872, 198, 10234, 30, 151645, 198, 151644, 77091, 198], type='encoded_text')])1317----------1318<|im_start|>system1319Answer concisely; at most one sentence per response<|im_end|>1320<|im_start|>user1321What is the longest-lived rodent species?<|im_end|>1322<|im_start|>assistant1323The naked mole rat, which can live over 30 years.<|im_end|>1324<|im_start|>user1325How do they live so long?<|im_end|>1326<|im_start|>assistant13271328```13291330You can see that the prompt is a `ModelInput` object, which is a list of `EncodedTextChunk` objects (but contains different objects in multi-modal data).133113321333**Sampling and parsing the response:**13341335Given that we're providing messages as input, we probably want a message output, rather than a token output. For that, we can use `parse_response`.13361337```python1338import tinker1339from tinker.types import SamplingParams1340service_client = tinker.ServiceClient()1341sampling_client = service_client.create_sampling_client(base_model='Qwen/Qwen3-30B-A3B')1342stop_sequences = renderer.get_stop_sequences()1343print(f"Stop sequences: {stop_sequences}")1344sampling_params = SamplingParams(max_tokens=100, temperature=0.5, stop=stop_sequences)1345output = sampling_client.sample(prompt, sampling_params=sampling_params, num_samples=1).result()1346print(f"Sampled tokens: {output.sequences[0].tokens}")1347sampled_message, parse_success = renderer.parse_response(output.sequences[0].tokens)1348print(f"Sampled message: {sampled_message}")1349print(f"Parse success: {parse_success}")1350```13511352**Output:**13531354```1355Stop sequences: [151645]1356Sampled tokens: [45, 7741, 34651, 31410, 614, 4911, 76665, 11, 2670, 264, 7548, 11050, 22077, 1849, 323, 264, 1602, 3347, 40761, 4379, 11, 892, 16792, 311, 862, 57119, 13, 151645]1357Sampled message: {'role': 'assistant', 'content': 'Naked mole rats have unique adaptations, including a highly efficient immune system and a very low metabolic rate, which contribute to their longevity.'}1358Parse success: True1359```13601361You can see that the there is one stop sequence, `151645`, which you can verify is the `<|im_end|>` token. The output is parsed successfully into a message.136213631364## Training: Supervised learning13651366For supervised learning (and some other algorithms like [DPO](/preferences/dpo-guide)), we need to distinguish between **prompt tokens** (context) and **completion tokens** (what the model should learn to generate). We want to provide a target assistant message, and the renderer needs to tell us which tokens are part of the prompt and completion.13671368We can use `build_supervised_example` to get a `ModelInput` and per-token loss weights:13691370```python1371model_input, weights = renderer.build_supervised_example(messages)13721373from tinker_cookbook.utils.format_colorized import format_colorized1374print(format_colorized(model_input.to_ints(), weights, tokenizer))1375```13761377We get the following output:13781379<div className="example">1380<span className="prompt"><|im_start|>system↵<br />Answer concisely; at most one sentence per response<|im_end|>↵<br /><|im_start|>user↵<br />What is the longest-lived rodent species?<|im_end|>↵<br /><|im_start|>assistant↵<br />The naked mole rat, which can live over 30 years.<|im_end|>↵<br /><|im_start|>user↵<br />How do they live so long?<|im_end|>↵<br /><|im_start|>assistant↵<br /></span>1381<span className="completion">They evolved multiple protective mechanisms including special hyaluronic acid that prevents cancer, extremely stable proteins, and efficient DNA repair systems that work together to prevent aging.<|im_end|><br /></span>1382</div>1383The green text is part of the prompt (i.e. with `weight=0`, so no loss is computed on these) and red is part of the completion (i.e. with `weight=1`, so the model is trained to predict these). Note that the ↵ have been inserted for clarity to show newlines; these are not actually part of the token sequence.13841385The key insight here is that only the final assistant message is treated as the completion. All previous context, including the first assistant response, is part of the prompt, so the model learns to continue conversations rather than just answer single questions.13861387## Vision Inputs13881389Tinker supports vision-language models (VLMs) like `Qwen/Qwen3-VL-30B-A3B-Instruct` and `Qwen/Qwen3-VL-235B-A22B-Instruct`. For low-level `ImageChunk` usage, see [Vision inputs](/training-sampling#vision-inputs) in the Training and Sampling guide. This section covers the higher-level message abstractions.13901391### Multimodal messages13921393For VLMs, message content can be either a string or a list of content parts:13941395```python1396from tinker_cookbook.renderers import Message, TextPart, ImagePart13971398# Text-only message (standard)1399text_message = Message(role='user', content='What is this?')14001401# Multimodal message with image1402multimodal_message = Message(1403role='user',1404content=[1405ImagePart(type='image', image='https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'),1406TextPart(type='text', text='What is in this image?'),1407]1408)1409```14101411For lower-level control using `ImageChunk` directly, see [Vision inputs](/training-sampling#vision-inputs) in the Training and Sampling guide.14121413### Using Qwen3VLRenderer14141415The `Qwen3VLRenderer` handles Qwen's vision special tokens (`<|vision_start|>`, `<|vision_end|>`) automatically:14161417```python1418from tinker_cookbook import renderers, tokenizer_utils1419from tinker_cookbook.image_processing_utils import get_image_processor14201421model_name = "Qwen/Qwen3-VL-235B-A22B-Instruct"1422tokenizer = tokenizer_utils.get_tokenizer(model_name)1423image_processor = get_image_processor(model_name)14241425renderer = renderers.Qwen3VLRenderer(tokenizer, image_processor)14261427messages = [1428{1429'role': 'user',1430'content': [1431{'type': 'image', 'image': 'https://thinkingmachines.ai/blog/on-policy-distillation/images/chess.png'},1432{'type': 'text', 'text': 'What is in this image?'},1433]1434}1435]14361437prompt = renderer.build_generation_prompt(messages)1438```14391440For a complete example of training a VLM image classifier, see the <CookbookLink path="tinker_cookbook/recipes/vlm_classifier">VLM Classifier recipe</CookbookLink> in the cookbook.14411442## Multi-turn RL and the Extension Property14431444When using renderers in multi-turn RL, an important consideration is whether consecutive timesteps satisfy the **extension property**—where each observation is a prefix extension of the previous observation plus action. This affects compute efficiency (O(T) vs O(T^2)) and KV-cache reuse.14451446Some renderers, like `Qwen3Renderer`, have options that affect this property. For example, `strip_thinking_from_history` controls whether `<think>` blocks are preserved in conversation history.14471448See the [Sequence Extension](/rl/sequence-extension) documentation for details on how this works and the tradeoffs involved.14491450## Appendix: Why not Jinja templates?14511452In our experience, the Jinja2 templates are harder to write than Python code, especially when we need to get the whitespace exactly right. They are also unwieldy for supervised learning, where you need to put different labels on different tokens.145314541455---14561457## File: completers.mdx14581459import { CookbookLink } from '../components/CookbookLink'14601461# Completers14621463The concept of policies is crucial to the RL training process. In the Tinker Cookbook, policies are implemented as `Completers`. Completers are abstractions that represent models or policies that can be sampled from, providing different levels of structure depending on your use case.14641465## Overview of Completer Types14661467The Tinker Cookbook provides two main types of completers, each designed for different use cases:146814691. **TokenCompleter**: Operates on tokens and is used by RL algorithms14702. **MessageCompleter**: Operates on messages and needs to be used with a renderer14711472The choice between these depends on whether you're working at the token level for RL training or at the message level for interacting with and evaluating the model.14731474### TokenCompleter14751476The `TokenCompleter` is the foundational interface used by RL algorithms because they work directly with tokens.14771478```python1479class TokenCompleter:1480async def __call__(1481self, model_input: types.ModelInput, stop: StopCondition1482) -> TokensWithLogprobs:1483```14841485This interface takes:1486- `model_input`: The input to the model (of type `types.ModelInput`)1487- `stop`: Stop conditions, either a list of strings or token IDs (combined into a `StopCondition` class). When training with reinforcement learning, this should be defined by the `initial_observation` function of the environment.14881489It returns a `TokensWithLogprobs` object containing:1490- `tokens`: The generated token sequence1491- `maybe_logprobs`: Optional log probabilities for each token14921493### MessageCompleter14941495The `MessageCompleter` operates at a higher level with structured messages, similarly to standard chat APIs. It takes a list of messages and returns a single assistant message response.14961497```python1498class MessageCompleter:1499async def __call__(self, messages: list[renderers.Message]) -> renderers.Message:1500```15011502For training purposes the `TokenCompleter` is the class we will use for RL training as we need to optimize the same same set of tokens during the update step that the model output during rollout. The `MessageCompleter` is useful for sampling where we need to use the model output for semantic purposes such as Judge models or multi-agent environments.15031504The Tinker Cookbook uses two concrete implementations of these interfaces - <CookbookLink path="tinker_cookbook/completers.py">`TinkerTokenCompleter`</CookbookLink> and <CookbookLink path="tinker_cookbook/completers.py">`TinkerMessageCompleter`</CookbookLink> which are both wrappers around a `tinker.SamplingClient`. While the TinkerTokenCompleter operates directly on tokens, the TinkerMessageCompleter needs to be instantiated with a renderer to make it compatible with the inputs expected by the samping client.150515061507---15081509## File: install.mdx15101511# Installing Tinker15121513Install the Tinker SDK with:15141515```bash1516pip install tinker1517```15181519Installation makes two components available: the python SDK and the tinker CLI.15201521#### Python SDK15221523The python SDK provides low-level operations like `forward_backward`, `sample`, `optim_step`, and `save_state`.15241525#### Tinker CLI15261527The tinker CLI is available as `tinker` or through `python -m tinker`. The CLI provides management functionality similar to that of the web console.15281529Run `tinker --help` to see which functionality is available.15301531## Tinker Cookbook15321533We also release [tinker-cookbook](https://github.com/thinking-machines-lab/tinker-cookbook), which is a collection of training code and experiment tools built on top of Tinker.1534For the Cookbook, we'd recommend doing a local editable install, as you'll probably want to browse and edit the code:15351536```bash1537git clone https://github.com/thinking-machines-lab/tinker-cookbook.git1538cd tinker-cookbook1539# Switch to your virtual environment1540pip install -e .1541```15421543## Getting an API key15441545Create an API key from the [console](https://tinker-console.thinkingmachines.ai). You'll then want to set the `TINKER_API_KEY` environment variable to your newly generated API key.154615471548---15491550## File: rl.mdx15511552import { CookbookLink } from '../components/CookbookLink'15531554# Reinforcement learning15551556Reinforcement learning (RL) means learning from trial and error. Whereas in supervised learning, we're given input-output pairs, in RL, we're given inputs (prompts) and reward functions (i.e., a function for scoring candidate outputs). RL algorithms need to discover what good outputs look like.15571558Here are a few different types of RL training that we support in the Tinker Cookbook:15591560- *RL with Verifiable Rewards*: this is when we do RL on a reward function that checks model outputs using a program. Typically, the reward function checks the candidate answer against a reference answer, or, in coding cases, it may check if the candidate solution passes some unit tests. RLVR is especially suitable for teaching models to do reasoning (with chain-of-thought) and multi-step tool use (e.g., debugging and iterative modification pf programs).1561- *RL on Human Feedback*: here, we assume we have an objective that can't be calculated by a simple program, and it requires some human judgement. For example, we typically want to optimize our models for helpfulness, which includes being clear, informative, and interesting. For RLHF, we train a *preference model* using supervised learning to match human judgement, scoring or ranking candidate outputs. Then we do RL on the preference model's scores. See the [Preferences](/preferences) section for more details.15621563We'll first show how to do small RL runs in the RLVR setting, then we'll show you how to define your own RL environments and train on them, then we'll provide examples for larger-scale or more complicated training setups.156415651566We anticipate that people will want to use Tinker for RL in a few different ways:15671568- Creating a specialist model that's SoTA at a specific skill, which existing models haven't been trained on. In this case, you'll want to start with a post-trained model that's already strong, and then do RL on an environment you've defined. See [RL Environments](/rl/rl-envs).1569- Doing research on post-training pipelines. In this case, you'll probably want to chain together SL and RL and runs with different data mixes, environments, and reward functions. See our [RLHF example](/preferences/rlhf-example).1570- Doing research on RL algorithms. Here, you'll probably want to find some existing environments to use as benchmarks, and either modify our provided training code (<CookbookLink path="tinker_cookbook/rl/train.py">rl/train.py</CookbookLink>) or write your own minimal training loop. We've provided a [minimal training loop](/rl/rl-loops) that you can use as a starting point.157115721573---15741575## File: under-the-hood.mdx15761577# Under the Hood15781579This page explains some implementation details of Tinker, which are important for understanding how to speed up your code.15801581## Clock Cycles15821583In Tinker, after you call `ServiceClient.create_lora_training_client`, your training job gets assigned to a pool of machines that working together -- a *worker pool* -- which are doing forward-backward operations repeatedly in lock-step.1584Each of these steps of the worker pool is called a *clock cycle*.1585In each clock cycle, we do forward-backward and an optimizer step operation, each of which may involve multiple LoRA models that are being trained by this pool.1586You can think of this pool as a single large training run that is time-shared between multiple different LoRA models, often from different users.15871588With multi-tenancy -- sharing the same worker pool between multiple models -- we can run the training system efficiently even if users are training with small batch sizes, or if they have other delays in their training loops that would otherwise leave the worker pool idle. Small batch sizes can often give better *sample efficiency*, so this setup lets us achieve both high compute efficiency and high sample efficiency.15891590The downside is that it can sometimes lead to worse *latency*: even if training with a small batch, you'll still see the same step time as a large batch. (Still, note that we'll only charge you for the compute you use.) Also, if your training loop is implemented naively, you might have to wait multiple clock cycles per batch, because you might miss a clock cycle between operations.15911592### Overlapping `forward_backward` and `optim_step` Requests15931594As mentioned in the [Async and Futures](/async) section, you should submit your `forward_backward` and `optim_step` requests together before waiting for either of them. This way, they'll end up on the same clock cycle. If you write the code naively, you'll end up using *three* clock cycles per training step. Here's a recap of the example from the [Async and Futures](/async) section:15951596**❌ Naive implementation (uses 3 clock cycles):**1597```python1598# Submit forward_backward, gets queued for clock cycle N1599fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)16001601# Wait for it to complete, and for client to receive the result1602# Due to communication latency, this happens a little after cycle N+1 started1603fwd_bwd_result = await fwd_bwd_future16041605# Submit optim_step, gets queued for clock cycle N+21606optim_future = await client.optim_step_async(adam_params)16071608# Wait for it to complete, and for client to receive the result1609# This happens a little after cycle N+2 finishes1610optim_result = await optim_future16111612# Total: forward_backward on cycle N, optim_step on cycle N+21613# This takes 3 clock cycles (plus the time we waited before cycle N started)1614```16151616**✓ Better implementation (uses 1 clock cycle):**1617```python1618# Submit both requests immediately. They'll both be slotted into the same clock cycle N1619fwd_bwd_future = await client.forward_backward_async(batch, loss_fn)1620optim_future = await client.optim_step_async(adam_params)16211622# Now wait for results - both operations happen on cycle N1623fwd_bwd_result = await fwd_bwd_future1624optim_result = await optim_future16251626# Total: both operations on cycle N1627# This takes 1 clock cycle1628```16291630### Pipelining to Maximize Clock Cycle Efficiency16311632To maximize efficiency and avoid missing clock cycles, you should **pipeline your training loop**: submit the next batch before waiting for the current batch to complete. This ensures there's always a request queued when a new clock cycle starts.16331634We've created a demonstration script that shows the difference between pipelined and non-pipelined training:16351636[View the clock cycles demonstration script →](/clock_cycles.py.txt)16371638The script includes two versions:16391640- **Non-pipelined**: Submits a batch, waits for it to complete, then submits the next. This approach typically wastes clock cycles because there's a gap between when one batch finishes and the next is submitted, often using 2 clock cycles per training step.16411642- **Pipelined**: Submits the next batch *before* waiting for the previous batch to complete. This approach often uses exactly 1 clock cycle per step, achieving maximum efficiency. Though it might sometimes take more than 1 clock cycle per step if the server is heavily loaded, or due to subtleties of our current implementation. (For example, if there are no other users, we might start the clock cycle after receiving the first `forward_backward` but before receiving the `optim_step`. Then we'll do `optim_step` on the next cycle. This causes an extra clock cycle but doesn't cause a slowdown.)16431644Running the script will show you the performance comparison, including total time and clock cycles used. The pipelined version typically saves both time and clock cycles.164516461647---16481649## File: model-lineup.mdx16501651# Available Models in Tinker16521653The table below shows the models that are currently available in Tinker. We plan to update this list as new models are released.16541655## What model should I use?16561657- In general, use MoE models, which are more cost effective than the dense models.1658- Use Base models only if you're doing research or are running the full post-training pipeline yourself1659- If you want to create a model that is good at a specific task or domain, use an existing post-trained model model, and fine-tune it on your own data or environment.1660- If you care about latency, use one of the Instruction models, which will start outputting tokens without a chain-of-thought.1661- If you care about intelligence and robustness, use one of the Hybrid or Reasoning models, which can use long chain-of-thought.16621663## Full Listing16641665| Model Name | Training Type | Architecture | Size |1666| ----------------------------------------------------------------------------------------------- | ------------- | ------------ | --------- |1667| [Qwen/Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) | Vision | MoE | Large |1668| [Qwen/Qwen3-VL-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) | Vision | MoE | Medium |1669| [Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) | Instruction | MoE | Large |1670| [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) | Instruction | MoE | Medium |1671| [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | Hybrid | MoE | Medium |1672| [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | Base | MoE | Medium |1673| [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | Hybrid | Dense | Medium |1674| [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | Hybrid | Dense | Small |1675| [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) | Base | Dense | Small |1676| [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | Instruction | Dense | Compact |1677| [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | Reasoning | MoE | Medium |1678| [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) | Reasoning | MoE | Small |1679| [deepseek-ai/DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | Hybrid | MoE | Large |1680| [deepseek-ai/DeepSeek-V3.1-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base) | Base | MoE | Large |1681| [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) | Base | Dense | Large |1682| [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | Instruction | Dense | Large |1683| [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | Base | Dense | Small |1684| [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | Instruction | Dense | Small |1685| [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) | Base | Dense | Compact |1686| [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | Base | Dense | Compact |1687| [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) | Reasoning | MoE | Large |16881689## Legend16901691### Training Types1692- **Base**: Foundation models trained on raw text data, suitable for post-training research and custom fine-tuning.1693- **Instruction**: Models fine-tuned for following instructions and chat, optimized for fast inference.1694- **Reasoning**: Models that always use chain-of-thought reasoning before their "visible" output that responds to the prompt.1695- **Hybrid**: Models that can operate in both thinking and non-thinking modes, where the non-thinking mode requires using a special renderer or argument that disables chain-of-thought.1696- **Vision**: Vision-language models (VLMs) that can process images alongside text. See [Vision Inputs](/rendering#vision-inputs) for usage.16971698### Architecture1699- **Dense**: Standard transformer architecture with all parameters active1700- **MoE**: Mixture of Experts architecture with sparse activation17011702### Model Sizes17031704- **Compact**: 1B-4B parameters1705- **Small**: 8B parameters1706- **Medium**: 30B-32B parameters1707- **Large**: 70B+ parameters17081709Note that the MoE models are much more cost effective than the dense models as their cost is proportional to the number of active parameters and not the total number of parameters.171017111712---17131714## File: preferences/dpo-guide.mdx17151716import { Callout } from 'nextra/components'1717import { CookbookLink } from '../../components/CookbookLink'17181719# Direct Preference Optimization (DPO)17201721Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.17221723## DPO Algorithm Details17241725The core DPO loss is computed as:17261727$$1728\mathcal{L}_{\theta} = -\mathbb{E}_{x, y_\text{chosen}, y_\text{rejected} \sim \mathcal{D}}\left[\log\sigma\left(\beta\log \frac{\pi_{\theta}(y_\text{chosen}|x)}{\pi_{\text{ref}}(y_\text{chosen}|x)} - \beta\log \frac{\pi_{\theta}(y_\text{rejected}|x)}{\pi_{\text{ref}}(y_\text{rejected}|x)}\right)\right]1729$$17301731Where:1732- $\pi_{\theta}$ is the current policy1733- $\pi_{\text{ref}}$ is the reference model (typically the initial model before DPO training)1734- $\beta$ is the DPO beta parameter1735- Where $\mathcal{D}$ is a dataset of prompts $x$, a chosen response $y_{\text{chosen}}$ and a rejected response $y_{\text{rejected}}$17361737This optimizes the classical constrianed RLHF objective, where the reference model constrains deviation from the initial distribution.17381739<Callout type="info">1740**DPO vs RLHF**: DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.1741</Callout>174217431744## Running DPO Training17451746The implementation is in <CookbookLink path="tinker_cookbook/preference/train_dpo.py">train_dpo.py</CookbookLink> with a CLI interface in <CookbookLink path="tinker_cookbook/recipes/preference/dpo/train.py">train.py</CookbookLink>. You can run it from the command line:17471748```bash1749python -m tinker_cookbook.recipes.preference.train \1750log_path=/tmp/dpo-hhh-experiment \1751model_name=meta-llama/Llama-3.2-1B \1752dataset=hhh \1753renderer_name=role_colon \1754learning_rate=1e-5 \1755dpo_beta=0.11756```17571758### Key Parameters17591760- `log_relpath`: Directory where results and checkpoints are saved1761- `model_name`: Base model used as initialization and for the reference policy1762- `dataset`: Dataset name (`hhh`, `helpsteer3`, `ultrafeedback`)1763- `renderer_name`: How conversations are formatted (see [Rendering](../rendering.mdx))1764- `learning_rate`: Learning rate for optimization1765- `dpo_beta`: DPO beta parameter (controls the strength of preference learning)17661767### Available Datasets17681769There are several pre-defined datasets:17701771- **`hhh`**: Anthropic's Helpful-Harmless-Honest dataset1772- **`helpsteer3`**: NVIDIA's HelpSteer3 preference dataset1773- **`ultrafeedback`**: UltraFeedback binarized preferences dataset17741775These are implemented as `DPODatasetBuilder` classes and you can implement a custom dataset builder following the `tinker_cookbook.preference.preference_datasets` interface.17761777## Training Process17781779During training, you'll see output like this showing the DPO metrics:17801781```1782Step 501783┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓1784┃ Metric ┃ Value ┃1785┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩1786│ accuracy │ 0.568627 │1787│ batch_time │ 27.953704 │1788│ chosen_reward │ 0.053621 │1789│ dpo_loss │ 0.683825 │1790│ learning_rate │ 0.000009 │1791│ margin │ 0.002147 │1792│ num_pairs │ 255 │1793│ num_tokens │ 112638 │1794│ progress │ 0.081210 │1795│ rejected_reward │ 0.032152 │1796│ test/nll │ 1.871778 │1797└────────────────────────────────┴───────────┘1798```17991800The key metrics are:1801- **`dpo_loss`**: The DPO classification loss1802- **`accuracy`**: Accuracy of the implicit reward model evaluated on the preference dataset1803- **`margin`**: Average difference between chosen and rejected rewards1804- **`chosen_reward`/`rejected_reward`**: Average rewards for chosen/rejected responses18051806## Evaluating DPO Models18071808After training, you can evaluate your DPO model using the inspect evaluation framework:18091810```bash1811MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE1812python -m tinker_cookbook.eval.run_inspect_evals \1813model_path=$MODEL_PATH \1814model_name=meta-llama/Llama-3.2-1B \1815tasks=inspect_evals/ifeval \1816renderer_name=role_colon1817```18181819This will evaluate the model on various benchmarks to measure the impact of preference optimization.18201821## Tips for DPO Training182218231. **Beta Parameter**: Start with `dpo_beta=0.1` and adjust based on your dataset.182418252. **Learning Rate**: Use a lower learning rate than supervised fine-tuning (typically 1e-5 to 1e-6).182618273. **Base Model**: The base model should already be in-distribution with the preference data. Either start with a ligh SFT phase or collect on-policy preferences. While training would still work. sharp distribution mis-match will create strange model behaviors.182818291830---18311832## File: preferences/rlhf-example.mdx18331834import { CookbookLink } from '../../components/CookbookLink'18351836# Reinforcement Learning from Human Feedback18371838We've provided a script that shows how to run a standard pipeline for reinforcement learning from human feedback (RLHF) in <CookbookLink path="tinker_cookbook/recipes/preference/rlhf/rlhf_pipeline.py">rlhf_pipeline.py</CookbookLink>.18391840```bash1841python -m recipes.preference.rlhf.rlhf_pipeline1842```18431844## Training the initial policy via supervised learning18451846First, we train the policy on the [no_robots dataset](https://huggingface.co/datasets/HuggingFaceH4/no_robots) from Huggingface, which is a basic instruction following dataset with human-written answers, which was designed to match the methodology from [InstructGPT](https://arxiv.org/abs/2203.02155).184718481849## Training the preference model via supervised learning18501851We train the preference model on the [HHH dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) from Anthropic, which is a dataset of pairwise comparisons of completions. We train a model that sees a pair of completions, A and B, and outputs which one is preferred.18521853## Training the policy via reinforcement learning18541855Taking the initial policy, and the preference model we just trained, we can now train the policy via reinforcement learning. This RL is a form of self-play, where we use the preference model to grade match-ups between the policy and itself. In particular, for each prompt, we sample multiple completions, and use the preference model to grade all pairs of completions. We then give the policy a reward based on the win fraction.185618571858---18591860## File: rl/rl-basic.mdx18611862import { CookbookLink } from '../../components/CookbookLink'18631864# Your First RL Run18651866We've provided a minimal script that runs RL on the [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k): <CookbookLink path="tinker_cookbook/recipes/rl_basic.py">rl_basic.py</CookbookLink>. You can run the minimal RL script from the command line as follows:18671868```bash1869python -m tinker_cookbook.recipes.rl_basic1870```18711872This script will fine-tune the Llama-3.1-8B base (pretrained) model on this dataset with the following reward function:18731874$$18751[\text{answer is correct}] + 0.1 \times (1[\text{answer is formatted correctly}] - 1)1876$$18771878The training should take about 1 minute per iteration and climb to about 63% accuracy after 15 iterations (`env/all/correct`). You can look at the printouts for some other metrics of interest:18791880- `ac_tokens_per_turn`: the number of each tokens in each generated completion1881- `env/all/format`: the fraction of completions that are formatted correctly1882- `env/all/reward/total`: mean total reward (combining format and correctness as defined above)1883- `entropy`: per-token entropy (mean negative log-probability of sampled tokens)1884- `kl_sample_train_{v1,v2}`: two different approximations/estimators of KL divergence between the sampler's and learner's probability distribution (contributed to by numerical differences and rounding noise)1885- `progress/done_frac`: what fraction of the total number of iterations we've completed so far1886- `time/...`: time for different parts of the training loop18871888You can also look at the `log_path` directory for more detailed metrics. There are several files of interest, which are mostly the same as in the [Supervised Learning](/supervised-learning/sl-basic) case.188918901891---18921893## File: rl/sequence-extension.mdx18941895import { CookbookLink } from '../../components/CookbookLink'18961897# Sequence Extension Property in Multi-Turn RL18981899When running reinforcement learning with multi-turn conversations, the way you render observations at each timestep has important implications for compute efficiency. This document explains the **extension property** and how it affects training and sampling.19001901## What is the Extension Property?19021903A sequence of observations has the **extension property** if each successive observation contains all previous observations and actions as a prefix. In other words, the context grows monotonically by appending new tokens to the end.19041905When this property holds, multiple timesteps can be merged into a single training datum, the KV-cache can be reused during sampling, and compute scales as O(T) rather than O(T^2) for a trajectory of length T.19061907## Example 1: Qwen3 with Thinking Visible (Extension Holds)19081909When using `Qwen3Renderer` with `strip_thinking_from_history=False`, the full conversation history (including `<think>` blocks) is preserved at each timestep. Consider a two-turn math conversation:19101911**Timestep 1:**1912<div className="example">1913<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 4<br/><br/>User:</span>1914</div>19151916**Timestep 2:**1917<div className="example">1918<span className="prompt">User: What is 2+2?<br/><br/>Assistant: <think>Let me calculate...</think> 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 6<br/><br/>User:</span>1919</div>19201921Notice that the observation (green) at timestep 2 contains the entire timestep 1 sequence as a prefix. The new observation just appends `What is 3+3?\n\nAssistant: ` to the end. This is the **extension property**.19221923Because extension holds, the RL code can merge both timesteps into a **single Datum**:19241925<div className="example">1926<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 4<br/><br/>User:</span><span className="prompt"> What is 3+3?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 6<br/><br/>User:</span>1927</div>19281929Green = observation tokens (loss weight = 0). Red = action tokens (loss weight > 0).19301931## Example 2: Qwen3 with Thinking Hidden (Extension Breaks)19321933When using `Qwen3Renderer` with the default `strip_thinking_from_history=True`, the `<think>...</think>` blocks are stripped from previous assistant messages. This matches how Qwen3 models were post-trained by the Qwen team.19341935**Timestep 1:**1936<div className="example">1937<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 4<br/><br/>User:</span>1938</div>19391940**Timestep 2:**1941<div className="example">1942<span className="prompt">User: What is 2+2?<br/><br/>Assistant: 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 6<br/><br/>User:</span>1943</div>19441945The observation at timestep 2 is **not** an extension of timestep 1's full sequence. The `<think>Let me calculate...</think>` portion was stripped, so the prefix doesn't match. The RL code must create **two separate Datums**:19461947**Datum 1:**1948<div className="example">1949<span className="prompt">User: What is 2+2?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 4<br/><br/>User:</span>1950</div>19511952**Datum 2:**1953<div className="example">1954<span className="prompt">User: What is 2+2?<br/><br/>Assistant: 4<br/><br/>User: What is 3+3?<br/><br/>Assistant: </span><span className="completion"><think>Let me calculate...</think> 6<br/><br/>User:</span>1955</div>19561957This results in more compute during training (two forward/backward passes instead of one) and prevents KV-cache reuse during sampling. For a trajectory of T timesteps, compute scales as O(T²) instead of O(T).19581959## The Tradeoff19601961**Keeping thinking visible** (`strip_thinking_from_history=False`) gives you O(T) compute scaling, allows packing sequences together in training batches, and enables KV-cache reuse during sampling. The downside is that context grows faster since all thinking tokens are retained, so you may hit context length limits sooner.19621963**Stripping thinking** (`strip_thinking_from_history=True`, the default) keeps context smaller but breaks the extension property, leading to O(T²) compute scaling.19641965Note that while stripping thinking matches Qwen3's original post-training distribution, with RL fine-tuning the model should quickly adapt to the new situation where thinking is preserved. So "distribution match" might not be a major concern in practice.19661967## How the RL Code Handles This19681969The RL training code in <CookbookLink path="tinker_cookbook/rl/data_processing.py">`data_processing.py`</CookbookLink> automatically detects whether consecutive timesteps satisfy the extension property. The key function is `trajectory_to_data`:19701971```python1972def trajectory_to_data(traj: Trajectory, traj_advantage: float) -> list[tinker.Datum]:1973"""1974Return one or more Datum objects corresponding to the trajectory.1975If the sequence grows by appending, i.e., each successive observation contains1976the previous observation+action as a prefix, then we can return a single Datum.1977However, if we get a sequence that's not an extension of the previous sequence,1978then that results in a new Datum.1979"""1980```19811982When rendering your conversations, be aware of whether your renderer has the extension property. For `Qwen3Renderer`:1983- `strip_thinking_from_history=False` → Extension holds1984- `strip_thinking_from_history=True` (default) → Extension breaks19851986**Note on sampling:** The training code automatically merges timesteps when possible. Sampling infrastructure doesn't yet adjust billing based on KV-cache hits, but this is planned for a future release.19871988## Advanced: Periodic Compaction19891990A hybrid approach is to use **periodic compaction**: keep thinking visible most of the time (preserving extension), but periodically clear old thinking blocks from the context.19911992**How it works:**1993- For turns 1-10, keep all thinking visible (extension holds, single datum)1994- At turn 11, strip thinking from turns 1-10 (extension breaks once, new datum starts)1995- For turns 11-20, keep thinking visible again (extension holds)1996- Repeat every N turns19971998Here's what the datums look like with compaction every 3 turns:19992000**Datum 1 (turns 1-3):**2001<div className="example">2002<span className="prompt">User: Q1<br/>Assistant: </span><span className="completion"><think>...</think> A1<br/>User:</span><span className="prompt"> Q2<br/>Assistant: </span><span className="completion"><think>...</think> A2<br/>User:</span><span className="prompt"> Q3<br/>Assistant: </span><span className="completion"><think>...</think> A3<br/>User:</span>2003</div>20042005**Datum 2 (turns 4-6, thinking from turns 1-3 stripped):**2006<div className="example">2007<span className="prompt">User: Q1<br/>Assistant: A1<br/>User: Q2<br/>Assistant: A2<br/>User: Q3<br/>Assistant: A3<br/>User: Q4<br/>Assistant: </span><span className="completion"><think>...</think> A4<br/>User:</span><span className="prompt"> Q5<br/>Assistant: </span><span className="completion"><think>...</think> A5<br/>User:</span><span className="prompt"> Q6<br/>Assistant: </span><span className="completion"><think>...</think> A6<br/>User:</span>2008</div>20092010This approach breaks extension only every N timesteps instead of every timestep, keeps context size bounded (old thinking doesn't accumulate forever), and amortizes the recomputation cost over N turns.20112012To implement this, you would modify your environment or renderer to periodically transform the conversation history, stripping `<think>` blocks from messages older than N turns.20132014## Summary20152016For `Qwen3Renderer`:2017- `strip_thinking_from_history=False` → Extension holds → Use for long trajectories where compute efficiency matters2018- `strip_thinking_from_history=True` (default) → Extension breaks → Use for short trajectories, or when you want minimal changes from base model behavior2019- Periodic compaction → Best of both worlds when you need efficiency with bounded context20202021When designing your RL environment, consider how many turns you expect and whether the O(T) vs O(T²) difference will be significant for your use case.202220232024---20252026## File: rl/rl-hyperparams.mdx20272028# RL Hyperparameters20292030This guide covers the key hyperparameters for reinforcement learning training, from core settings to advanced configurations.20312032## Core Hyperparameters20332034### Learning Rate20352036Similar to the [supervised learning setting](../supervised-learning/sl-hyperparams), the learning rate is the most critical hyperparameter choice. We recommend using the guidance presented there as a starting point for RL experiments as well.203720382039### Batch and Group Sizes20402041As described in our [RL environments](../rl/rl-envs.mdx) documentation, we use two key parameters:20422043- **`batch_size`**: The number of unique environments or problems used for training2044- **`group_size`**: The number of rollouts performed per unique environment20452046If you have limited environments or problems available for training, increase the `group_size` to generate more training data. While the total number of rollouts depends on both parameters, we recommend scaling learning rates proportionally to $\text{LR} \propto \sqrt{\text{batch\_size}}$.20472048## Multiple Updates per Sampling Iteration20492050The `num_substeps` parameter controls how many policy weight updates are performed on data sampled from the last policy iteration, similar to PPO and GRPO.20512052### How it works:20532054- **`num_substeps = 1` (default)**: Each batch of collected trajectories is used for exactly one optimizer update2055- **`num_substeps > 1`**: The batch of unique environments is split into `num_substeps` mini-batches, where each environment/problem has `group_size` rollouts (we pack all rollouts for a particular environment/problem in the same minibatch). We do a single update step on each mini-batch. Note that our implementation still takes only a single epoch through the data.20562057### Usage Guidelines:20582059- The batch size must be divisible by `num_substeps`2060- Our experiments show that `num_substeps = 1` already gives decent performance, but if you would like to experiment with this parameter, we recommend starting with a low value of 2-4 and using the PPO objective.2061- Higher values can lead to update steps that are too out-of-distribution for the policy. Consider limiting the number of updates or decreasing the learning rate when using multiple update steps.20622063## Advanced Training Configurations20642065⚠️ **Note**: These features are experimental and may be subject to instabilities. They are currently disabled by default.20662067### Streaming Minibatch Training20682069Enable streaming minibatch training by specifying the `StreamMinibatchConfig`. This approach overlaps trajectory sampling and model training, improving overall throughput by submitting training requests as soon as enough rollouts complete, without waiting for all sampling jobs to finish.20702071**Configuration Parameters:**20722073- **`groups_per_batch`**: Same as batch size2074- **`num_minibatches`**: Number of minibatches per substep—controls how many individual forward-backward requests we submit. This controls how the work is split.207520762077**Important**: This remains on-policy training and is strictly a pipeline efficiency improvement.20782079### Async Off-Policy Training20802081Async training allows the model to train on trajectories generated with slightly older model versions, enabling higher throughput at the cost of some off-policy bias. While Tinker doesn't currently support in-flight weight changes, it supports the "off-by-K" async RL approach where multiple model iterations generate data simultaneously. Configure this by setting the `AsyncConfig` object.20822083**Configuration Parameters:**20842085- **`max_steps_off_policy`**: Maximum age (in training steps) of trajectories before they're discarded. Essentially, trajectories from policy iterations older than `max_steps_off_policy` steps will not be used.2086- **`groups_per_batch`**: Number of new trajectory groups to accumulate (with a `group_size` number of rollouts each) before updating the current iteration of the model. Note: This is separate from the batch size used for dataset construction.20872088**Usage Guidelines:**20892090- Async RL is appropriate for applications with long and heterogeneous rollouts, such as very long CoT models, multi-hop tool use, or agentic workflows2091- Start with a small value for `max_steps_off_policy` (less than 5)2092209320942095## Monitoring and Run Health20962097Using policy-gradient algorithms with off-policy data can significantly degrade performance or even crash the policy, making monitoring essential during training.20982099### KL Divergence Monitoring21002101The current implementation logs the KL divergence between the data generation policy and the current learner: $\mathbb{D}_{KL}[\pi_{\text{sampler}}(\cdot|x)||\pi_{\theta}(\cdot|x)]$ using two separate estimators ([Schulman 2020](http://joschu.net/blog/kl-approx.html)):21022103- `kl_sample_train_v1`2104- `kl_sample_train_v2`210521062107A few important notes to keep in mind:2108- Even with full on-policy training, the divergence between sampling and learning policies will not be exactly zero ([He 2025](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)) due to implementation details2109- In our experience training is stable with KL divergence below 0.012110- If KL divergence crosses a recommended threshold, this indicates a numerical instability or potential issue with the training run211121122113---21142115## File: rl/rl-loops.mdx21162117import { CookbookLink } from '../../components/CookbookLink'21182119# Reinforcement Learning Training Loop21202121We've provided a simple RL training loop in <CookbookLink path="tinker_cookbook/recipes/rl_loop.py">rl_loop.py</CookbookLink>, which avoids using our environment classes and instead defines the data loading and rollouts in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in <CookbookLink path="tinker_cookbook/rl/train.py">rl/train.py</CookbookLink> does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.21222123You can run the RL training loop using:2124```2125python -m tinker_cookbook.recipes.rl_loop2126```21272128The default config should write the results to `/tmp/tinker-examples/rl-loop`. The experiment should be completed after 57 steps of training. You can plot the reward curve as follows:2129```python2130import pandas2131import matplotlib.pyplot as plt21322133metrics_path = "/tmp/tinker-examples/rl-loop/metrics.jsonl"2134df = pandas.read_json(metrics_path, lines=True)2135plt.plot(df["reward/total"], label="reward/total")2136plt.legend()2137plt.show()2138```21392140You should see a plot like this:2141214221432144---21452146## File: rl/rl-envs.mdx21472148import { CookbookLink } from '../../components/CookbookLink'21492150# RL Environments21512152Here, we'll explain how to create your own RL environments and train on them. First, lets look at the basic classes, which can be found in <CookbookLink path="tinker_cookbook/rl/types.py">`tinker_cookbook.rl.types`</CookbookLink>. As you can see, there's an `Env` interface, corresponding to an RL environment. To write an environment, you need to implement two methods: `initial_observation` and `step`.21532154```python2155class Env:2156"""2157Stateful environment that a single agent interacts with.2158Discard after running for one episode.2159"""21602161async def initial_observation(self) -> tuple[Observation, StopCondition]:2162raise NotImplementedError21632164async def step(self, action: Action) -> StepResult:2165raise NotImplementedError2166```21672168Note that this `Env` operates on *tokens*, rather than strings or messages. Why define it this way, when it's usually more natural to define the logic in terms of strings or messages? We've defined `Env` this way because this interface is what's needed by the *training* code, which needs to know the exact tokens that were sampled, and their logprobs.21692170We need to write two more small classes to use this environment in the RL training code. First, since the environment is discarded after a single episode, we need to be able to instantiate new environments in the training loop. We actually build a *group* of environments at a time, which enables multi-agent training or objectives that compare multiple samples (for example, a reward model that acts on a pair of samples).21712172```python2173class EnvGroupBuilder:2174"""2175Builds a group of environments.2176"""21772178async def make_envs(self) -> Sequence[Env]:2179raise NotImplementedError2180```21812182This object creates a group of environments. Often it does the trivial thing of returning a list of copies of the same environment.21832184Finally, we need a dataset of these EnvGroupBuilders.21852186```python2187class RLDataset:2188"""2189Dataset of EnvGroupBuilders.2190"""21912192def get_batch(self, index: int) -> list[EnvGroupBuilder]:2193raise NotImplementedError2194```219521962197That's a lot of classes! But their combination gives us a lot of flexibility. In previous implementations (like OpenAI Gym), the dataset is implicitly part of the environment; this structure is more modular and gives us more control over the data loading.21982199## Building a simple example22002201You can find an example of writing a new RL environment in the <CookbookLink path="tinker_cookbook/recipes/multiplayer_rl/twenty_questions">Twenty Questions</CookbookLink> directory.2202Here, we define a multi-step environment, where we're training a question-asking agent, which asks questions to another agent to guess a hidden word.2203In this case, the answerer model is fixed and is Llama-3.1-8B-Instruct.2204The player model (which we fine-tune) is also based on that same model.22052206You can run the training script as follows:22072208```bash2209python -m tinker_cookbook.recipes.twenty_questions.train2210```221122122213---22142215## File: supervised-learning/sl-hyperparams.mdx22162217# Supervised Learning Hyperparameters22182219Successful LLM fine-tuning requires careful hyperparameter tuning. While the most accurate approach is to sweep over ranges and selecting values that minimize loss or maximize eval performance for each hyperparameter, this is often time-consuming and expensive. This guide provides some starting recommendations for the most important hyperparameters.222022212222## Learning rate22232224The most important hyperparameter is generally the learning rate (LR). Our current best estimate of optimal LR for a model $m$ is the following:22252226$$ LR(m) = lr_{base} · M_{LoRA} · \Big(\frac{2000}{H_m}\Big)^{P_m} $$22272228where $lr_{base}$ is a constant base LR, $M_{LoRA}$ is a multiplier applied when using LoRA (1 if using full-finetuning), $H_m$ is the hidden size of the model $m$, and $P_m$ is a model-specific exponent adjustment. Importantly, this function is independent of the LoRA rank.22292230Our current best estimates are the following: $lr_{base} = 5e-5$,2231$M_{LoRA} = 10$, $P_m = 0.0775$ for Qwen models and $P_m = 0.781$ for Llama models.22322233### Getting the recommended learning rate2234You can use the following function to get the recommended LR for any Llama or Qwen model:2235```2236from tinker_cookbook.hyperparam_utils import get_lr2237model_name = "meta-llama/Llama-3.2-1B"2238recommended_lr = get_lr(model_name)2239print(f"Recommended LR: {recommended_lr}")2240```2241### Validation2242We validated this formula across diverse supervised fine-tuning experiments, varying datasets, dataset sizes, batch_sizes and lora_ranks.22432244Using our LR estimates resulted in \<0.5% regret compared to exhaustive hyperparameter sweeps, where regret is defined as:22452246We can define the regret of using any lr as the following:2247$$regret(lr') = \frac{loss(lr') - min_{lr} loss(lr)}{min_{lr} loss(lr)}$$224822492250## Batch size22512252Batch size is the second-most important hyperparameter; it significantly affects both training efficiency and final performance.22532254For small batch sizes, there's a phenomenon of *perfect scaling*, where the LR and batchsize should be varied together as $LR \propto \sqrt{B}$, and the learning curve only depends on $\frac{LR}{\sqrt{B}}$. See [Shallue et al. (2018)](https://arxiv.org/abs/1811.03600) for an example in the training-from-scratch setting.22552256When fine-tuning LLMs, we're often in a regime where smaller batch sizes give better performance, at the cost of longer training time; moreover, the $LR \propto \sqrt{B}$ scaling doesn't always hold. When doing SL fine-tuning, we recommend using smaller batch sizes like 128, depending on your tolerance for longer training time.22572258For best results, you should aim for at least 100 steps of training (but usually get best results with 1000 or more).22592260⚠️ Note: Our batch size recommendations are based on preliminary findings and ongoing research. We're not confident about them!226122622263---22642265## File: supervised-learning/sl-basic.mdx22662267import { CookbookLink } from '../../components/CookbookLink'22682269# Basic Supervised Learning22702271This guide walks you through running your first supervised learning experiment using Tinker's built-in training loop.22722273## Quick start22742275We've provided an implementation of supervised learning in <CookbookLink path="tinker_cookbook/supervised/train.py">train_cli.py</CookbookLink>. To use this training loop, you'll need to create a `Config` object with the data and parameters.22762277We've provided a ready-to-run example that fine-tunes Llama-3.1-8B on a small instruction-following dataset in <CookbookLink path="tinker_cookbook/recipes/sl_basic.py">sl_basic.py</CookbookLink>. You can run it from the command line as follows:22782279```bash2280python -m tinker_cookbook.recipes.sl_basic2281```22822283This script fine-tunes the base (pretrained) model on a small dataset called [NoRobots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), created by Hugging Face.22842285### What you'll see during training22862287- Each step you should see a printout of the train and test loss, along with other stats like timing.2288- The training script will also print out what the data looks like, with predicted tokens (weight=1) in green and context tokens (weight=0) in yellow.2289- The training script will write various logs and checkpoint info to the `log_path` directory, which is set to `/tmp/tinker-examples/sl_basic` in the example script.22902291### Understanding the output files2292Looking at the `log_path` directory, you will find several files of interest:2293- `metrics.jsonl`: the training metrics that also were printed to the console. You can load and plot them like this:22942295```python2296import pandas2297import matplotlib.pyplot as plt2298df = pandas.read_json("/tmp/tinker-examples/sl_basic/metrics.jsonl", lines=True)2299plt.plot(df['train_mean_nll'], label='train_loss')2300plt.plot(df['test/nll'].dropna(), label='test_loss')2301plt.legend()2302plt.show()2303```2304You should see a plot like this:2305230623072308- `checkpoints.jsonl`: the checkpoints that were saved during training. Recall from [Saving and Loading](/save-load) that there are (currently) two kinds of checkpoints: one that has "/sampler_weights/" in the path (used for sampling), and the other that has "/weights/" in the path (includes full optimizer state, used for resuming training). If you interrupt the training script, then run it again, it will ask you if you want to resume training. If you choose to do so, it'll load the last (full state) checkpoint from this file.2309- `config.json`: the configuration that you used for training.23102311In the `sl_basic` script, you'll see that there's also some disabled code (under `if 0:`) that shows how to use your own dataset, specified as a JSONL file, provided in the format of <CookbookLink path="example-data/conversations.jsonl">conversations.jsonl</CookbookLink>.231223132314---23152316## File: supervised-learning/prompt-distillation.mdx23172318import { CookbookLink } from '../../components/CookbookLink'23192320# Prompt Distillation23212322Prompt distillation is a training technique in which a model is optimized to behave as though it had been provided with a long and complex prompt, without requiring access to that prompt during inference.23232324At a high level, this procedure involves two main steps:2325- **Creation of distillation data**: A teacher prompt, which is typically lengthy and highly detailed, provides explicit, step-by-step instructions. A teacher model uses this prompt to generate responses for a set of queries.2326- **Training the student model**: A student model is then trained (or fine-tuned) on the distilled dataset, thereby learning to reproduce the essential behaviors and reasoning encoded in the teacher’s instructions.23272328---23292330## Overview23312332Let $f_T$ and $f_S$ denote the teacher and student models, respectively. Given an instruction prompt $P$ and a query $q_i$, the teacher model generates a response $r_i$:23332334$$2335r_i = f_T([P, q_i])2336$$23372338Here, the prompt $P$ and the query $q_i$ are concatenated to form the input to the teacher model $f_T$. For a dataset of queries $Q = \{q_i \mid 1 \leq i \leq D\}$, we obtain a corresponding set of teacher responses $R = \{r_i \mid 1 \leq i \leq D\}$.23392340The distillation training dataset is defined as the set of query–response pairs (excluding the original prompt):23412342$$2343T = \{(q_i, r_i) \mid 1 \leq i \leq D\}.2344$$23452346The student model $f_S$ is then trained to minimize the cross-entropy loss:23472348$$2349\ell(f_S(q_i), r_i) = \ell(f_S(q_i), f_T([P, q_i])).2350$$23512352---23532354## Example23552356The Tinker Cookbook provides a prompt distillation recipe tailored for a language classification task. The objective is straightforward: given a text query, the model should predict a two-character code corresponding to the language of the input. The set of possible labels is:2357```2358ar (Arabic), de (German), el (Greek), en (English), es (Spanish), fr (French), hi (Hindi), ru (Russian), tr (Turkish), ur (Urdu), vi (Vietnamese), zh (Chinese - Simplified), ot (Other/Unknown).2359```23602361The recipe in <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/create_data.py">recipes/prompt_distillation/create_data.py</CookbookLink> also includes handling strategies for inputs containing code, numerical content, or multiple languages.23622363In the example below, the same model (`Qwen/Qwen3-30B-A3B`) is used as both teacher and student, though in general they need not be identical.23642365---23662367### Step 1: Generate Training Data23682369Create prompt distillation data using the teacher model using <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/create_data.py">recipes/prompt_distillation/create_data.py</CookbookLink>:23702371```bash2372python -m tinker_cookbook.recipes.prompt_distillation.create_data \2373output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl2374```23752376This command will:2377- Use the configured teacher model to generate language classification examples2378- Save the distilled dataset to the specified output file2379- Create diverse training examples suitable for student model fine-tuning23802381### Step 2: Train the Student Model23822383Fine-tune a student model on the distillation data using <CookbookLink path="tinker_cookbook/recipes/prompt_distillation/train.py">recipes/prompt_distillation/train.py</CookbookLink>:23842385```bash2386python -m tinker_cookbook.recipes.prompt_distillation.train2387```23882389The training script will:2390- Load the generated distillation dataset2391- Apply optimized training configurations2392- Fine-tune the student model for language classification23932394### Step 3: Test Your Model23952396Once training is complete, you can test your distilled model by sampling from the trained model to verify its performance on language classification tasks.23972398## Advanced Configuration23992400The prompt distillation recipe can be customized for different scenarios:24012402- **Teacher model selection**: Choose different base models based on your requirements2403- **Sampling strategies**: Adjust temperature and other generation parameters2404- **Data volume**: Scale the number of generated examples based on your needs2405- **Training hyperparameters**: Fine-tune learning rates and other training settings240624072408---24092410## File: supervised-learning/sweep-case-study.mdx24112412import { CookbookLink } from '../../components/CookbookLink'24132414# Sweep case study24152416In [Supervised Learning Hyperparameters](./sl-hyperparams), we introduced default hyperparameters as a starting point. While defaults are useful, optimal values are often task-specific. A hyperparameter sweep---systematically testing values across a range---is a more reliable way to identify the best settings for your use case.24172418This guide demonstrates how to sweep over the **learning rate (LR)** to find an optimal value.24192420## Why sweep the learning rate?24212422The learning rate is typically the most impactful hyperparameter. While our default recommendations perform well (usually \<0.5% regret), you can often achieve even better results by sweeping to find the task-specific optimum.242324242425## Setup24262427We use the simple supervised learning training loop in2428<CookbookLink path="tinker_cookbook/recipes/sl_loop.py">sl_loop.py</CookbookLink>, which trains a Llama-3.1-8B model.24292430To retrieve the model’s default learning rate recommendation:2431```2432from tinker_cookbook.hyperparam_utils import get_lr2433print(get_lr("meta-llama/Llama-3.1-8B"))2434```2435This should output2436```24370.0002856415043086949 # ≈ 2.8e-42438```2439This default value provides a baseline. A common best practice is to sweep one order of magnitude above and below the default. For this case, we sweep over: $LR \in [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]$2440244124422443## Running the sweep2444Launch experiments in parallel, using separate terminal windows for each LR value. For example:2445```bash2446python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sft-lr-sweep/lr-0.0032447python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sft-lr-sweep/lr-0.0012448python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sft-lr-sweep/lr-0.00032449python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sft-lr-sweep/lr-0.00012450python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sft-lr-sweep/lr-0.000032451python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sft-lr-sweep/lr-0.000012452```2453You can also automate this process by writing a script that spawns multiple tmux windows and launches experiments programmatically. This is especially useful for larger sweeps.245424552456## Collecting Results2457After the experiments are complete, you can read the `metrics.jsonl` files:2458```python2459from glob import glob2460import pandas2461import os2462import json24632464data = []2465for fname in sorted(glob(os.path.expanduser("/tmp/sft-lr-sweep/*/metrics.jsonl"))):2466df = pandas.read_json(fname, lines=True)2467# make sure the experiment is completed2468if len(df) == 0 or df["progress"].iloc[-1] < 0.98:2469continue2470config_fname = fname.replace("metrics.jsonl", "config.json")2471with open(config_fname, "rb") as f:2472metadata = json.load(f)2473data.append({2474"fname": fname,2475"learning_rate": metadata["learning_rate"],2476"final_loss": df["train_mean_nll"].iloc[-1].item()2477})24782479print(f"Read metrics for {len(data)} experiments")2480```2481If all the experiments are completed, the above code should print:2482```2483Read metrics for 6 experiments2484```24852486## Visualizing the Sweep2487Plot the `final_loss` as a function of `learning_rate`:2488```python2489import matplotlib.pyplot as plt2490df = pandas.DataFrame(data)2491plt.plot(df["learning_rate"], df["final_loss"], marker='o')2492plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--")2493plt.ylim(1.65, 1.8)2494plt.xscale("log")2495plt.xlabel("Learning Rate (log scale)")2496plt.ylabel("Final Loss")2497plt.title("Final Loss vs Learning Rate")2498plt.show()2499```2500You should see a U-shaped curve, similar to this:250125022503If the full U-curve is not visible in your setting, expand the sweep range by adding more LR values.250425052506## Determining the Optimal LR2507The optimal learning rate is the one that minimizes the loss. The plot above shows that the optimal LR is `3e-4` which you can also calculate by finding the minima:2508```2509optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]2510print(f"The optimal LR is {optimal_lr:.2e}")2511```2512Expected output:2513```2514The optimal LR is 3.00e-042515```25162517Note that the optimal LR in our sweep (`3e-4`) is very close to the default LR (`2.8e-4`). However, task-specific sweeps can still provide marginal improvements and greater confidence in your hyperparameter choices.25182519## Next steps2520Now that you've identified the optimal learning rate:25211. Retrain with the optimal LR for your production run25222. Consider sweeping other hyperparameters like batch size, warmup steps, or weight decay25233. Use the optimal LR as a baseline for future experiments on similar tasks252425252526---25272528## File: supervised-learning/sl-loop.mdx25292530import { CookbookLink } from '../../components/CookbookLink'25312532# Supervised Learning Training Loop25332534We've provided a simple SL training loop in <CookbookLink path="tinker_cookbook/recipes/sl_loop.py">sl_loop.py</CookbookLink>, which avoids using our dataset classes and instead defines the data loading in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in <CookbookLink path="tinker_cookbook/supervised/train.py">supervised/train.py</CookbookLink> does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.253525362537---25382539## File: compatible-apis/openai.mdx25402541# OpenAI API Compatible Inference (in beta)25422543OpenAI-compatible inference lets you interact with any model checkpoint in Tinker, using an endpoint compatible with the [OpenAI Completions API](https://platform.openai.com/docs/api-reference/chat). It’s designed to let you easily “poke at” your model while you're training it.25442545For inference within your training runs (e.g. RL), we recommend using Tinker’s standard [sampling client](/training-sampling).25462547Currently, OpenAI-compatible inference is meant for testing and internal use with low internal traffic, rather than large, high-throughput, user-facing deployments. Latency and throughput may vary by model and may change without notice during the beta. If you need higher or more stable throughput, contact the Tinker team in [our Discord](https://discord.gg/KqqEZNX88c) for guidance on larger-scale setups.25482549## Use Cases25502551OpenAI-compatible inference is designed for2552- **Fast feedback while training**: Start sampling very quickly from any sampler checkpoint obtained during training.2553- **Sampling while training continues**: Sample even while the training job is still running on that experiment.2554- **Developer & internal workflows**: Intended for testing, evaluation, and internal tools.25552556We will release production-grade inference soon and will update our users then.25572558## Using OpenAI compatible inference from an OpenAI client25592560The new interface exposes an OpenAI-compatible HTTP API. You can use any OpenAI SDK or HTTP client that lets you override the base URL.256125621\. Set the base URL of your OpenAI-compatible client to:25632564```2565https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v12566```256725682\. Use a Tinker sampler weight path as the model name. For example:25692570```2571tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/0000802572```25732574Any valid Tinker sampler checkpoint path works here. You can keep training and sample from the same checkpoint simultaneously.257525763\. Authenticate with your Tinker API key, by passing the same key used for Tinker as the API key to the OpenAI client.25772578**Note:** We support both `/completions` and `/chat/completions` endpoints. Chat requests are rendered with the model’s default Hugging Face chat template; if your checkpoint expects a different renderer, render the prompt yourself (see [Rendering](/rendering)) and use `/completions`.25792580## Code Example25812582```py2583from os import getenv2584from openai import OpenAI25852586BASE_URL = "https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1"2587MODEL_PATH = "tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080"25882589api_key = getenv("TINKER_API_KEY")25902591client = OpenAI(2592base_url=BASE_URL,2593api_key=api_key,2594)25952596response = client.completions.create(2597model=MODEL_PATH,2598prompt="The capital of France is",2599max_tokens=50,2600temperature=0.7,2601top_p=0.9,2602)26032604print(f"{response.choices[0].text}")2605```26062607Notes:26082609* `BASE_URL` points to the OpenAI compatible inference endpoint.2610* `MODEL_PATH` is a sampler checkpoint path from Tinker (`tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080`).2611* The rest of the arguments (`prompt`, `max_tokens`, `temperature`, `top_p`) behave like they do in the OpenAI Completions API.2612* You can swap `MODEL_PATH` to any other sampler checkpoint to compare runs quickly in your evals or notebooks.26132614## Related docs26152616* [Getting a `TINKER_API_KEY`](/install)26172618* [Security and Privacy](https://thinkingmachines.ai/legal/terms/)26192620* [Training and Sampling](/training-sampling)262126222623---26242625# PART 2: TYPE DEFINITIONS26262627Total types collected: 3026282629## Type: AdamParams26302631```python2632class AdamParams(StrictBase):2633learning_rate: float = 0.00012634"""Learning rate for the optimizer"""26352636beta1: float = 0.92637"""Coefficient used for computing running averages of gradient"""26382639beta2: float = 0.952640"""Coefficient used for computing running averages of gradient square"""26412642eps: float = 1e-122643"""Term added to the denominator to improve numerical stability"""2644```26452646## Type: CreateModelResponse26472648```python2649class CreateModelResponse(BaseModel):2650model_id: ModelID26512652type: Literal["create_model"] = "create_model"2653```26542655## Type: Datum26562657```python2658class Datum(StrictBase):2659loss_fn_inputs: LossFnInputs2660"""Dictionary mapping field names to tensor data"""26612662model_input: ModelInput26632664@model_validator(mode="before")2665@classmethod2666def convert_tensors(cls, data: Any) -> Any:2667"""Convert torch.Tensor and numpy arrays to TensorData in loss_fn_inputs during construction."""2668if isinstance(data, dict) and "loss_fn_inputs" in data:2669loss_fn_inputs = data["loss_fn_inputs"]2670if isinstance(loss_fn_inputs, dict):2671converted_inputs = {}2672for key, value in loss_fn_inputs.items():2673converted_inputs[key] = cls._maybe_convert_array(key, value)2674data = dict(data) # Make a copy2675data["loss_fn_inputs"] = converted_inputs2676return data26772678@classmethod2679def _maybe_convert_array(cls, key: str, value: Any) -> Any:2680"""Convert torch.Tensor, numpy array, or 1-D list to TensorData if needed."""2681if _HAVE_TORCH and isinstance(value, torch.Tensor):2682return TensorData.from_torch(value)2683elif isinstance(value, np.ndarray):2684return TensorData.from_numpy(value)2685elif isinstance(value, list):2686# assume it's 1d and infer the dtype from the key2687return TensorData(data=value, dtype=_key_to_type[key], shape=[len(value)])2688else:2689return value269026912692_key_to_type = {2693"target_tokens": "int64",2694"weights": "float32",2695"advantages": "float32",2696"logprobs": "float32",2697"clip_low_threshold": "float32",2698"clip_high_threshold": "float32",2699}2700```27012702## Type: EncodedTextChunk27032704```python2705class EncodedTextChunk(StrictBase):2706tokens: Sequence[int]2707"""Array of token IDs"""27082709type: Literal["encoded_text"] = "encoded_text"27102711@property2712def length(self) -> int:2713return len(self.tokens)2714```27152716## Type: ForwardBackwardInput27172718```python2719class ForwardBackwardInput(StrictBase):2720data: List[Datum]2721"""Array of input data for the forward/backward pass"""27222723loss_fn: LossFnType2724"""Fully qualified function path for the loss function"""27252726loss_fn_config: Optional[Dict[str, float]] = None2727"""Optional configuration parameters for the loss function (e.g., PPO clip thresholds, DPO beta)"""2728```27292730## Type: ForwardBackwardOutput27312732```python2733class ForwardBackwardOutput(BaseModel):2734loss_fn_output_type: str2735"""The type of the ForwardBackward output. Can be one of [...] TODO"""27362737loss_fn_outputs: List[LossFnOutput]2738"""Dictionary mapping field names to tensor data"""27392740metrics: Dict[str, float]2741"""Training metrics as key-value pairs"""2742```27432744## Type: GetInfoResponse27452746```python2747class GetInfoResponse(BaseModel):2748type: Optional[Literal["get_info"]] = None27492750model_data: ModelData27512752model_id: ModelID27532754is_lora: Optional[bool] = None27552756lora_rank: Optional[int] = None27572758model_name: Optional[str] = None27592760if PYDANTIC_V2:2761# allow fields with a `model_` prefix2762model_config = ConfigDict(protected_namespaces=tuple())2763```27642765## Type: GetServerCapabilitiesResponse27662767```python2768class GetServerCapabilitiesResponse(BaseModel):2769supported_models: List[SupportedModel]2770```27712772## Type: ImageAssetPointerChunk27732774```python2775class ImageAssetPointerChunk(StrictBase):2776format: Literal["png", "jpeg"]2777"""Image format"""27782779location: str2780"""Path or URL to the image asset"""27812782expected_tokens: int | None = None2783"""Expected number of tokens this image represents.2784This is only advisory: the tinker backend will compute the number of tokens2785from the image, and we can fail requests quickly if the tokens does not2786match expected_tokens."""27872788type: Literal["image_asset_pointer"] = "image_asset_pointer"27892790@property2791def length(self) -> int:2792if self.expected_tokens is None:2793raise ValueError("ImageAssetPointerChunk expected_tokens needs to be set in order to compute the length")2794return self.expected_tokens2795```27962797## Type: ImageChunk27982799```python2800class ImageChunk(StrictBase):2801data: bytes2802"""Image data as bytes"""28032804format: Literal["png", "jpeg"]2805"""Image format"""28062807expected_tokens: int | None = None2808"""Expected number of tokens this image represents.2809This is only advisory: the tinker backend will compute the number of tokens2810from the image, and we can fail requests quickly if the tokens does not2811match expected_tokens."""28122813type: Literal["image"] = "image"28142815@field_validator("data", mode="before")2816@classmethod2817def validate_data(cls, value: Union[bytes, str]) -> bytes:2818"""Deserialize base64 string to bytes if needed."""2819if isinstance(value, str):2820return base64.b64decode(value)2821return value28222823@field_serializer("data")2824def serialize_data(self, value: bytes) -> str:2825"""Serialize bytes to base64 string for JSON."""2826return base64.b64encode(value).decode("utf-8")28272828@property2829def length(self) -> int:2830if self.expected_tokens is None:2831raise ValueError("ImageChunk expected_tokens needs to be set in order to compute the length")2832return self.expected_tokens2833```28342835## Type: LoadWeightsResponse28362837```python2838class LoadWeightsResponse(BaseModel):2839path: Optional[str] = None2840"""A tinker URI for model weights at a specific step"""28412842type: Optional[Literal["load_weights"]] = None2843```28442845## Type: LoraConfig28462847```python2848class LoraConfig(StrictBase):2849rank: int2850"""LoRA rank (dimension of low-rank matrices)"""28512852seed: Optional[int] = None2853"""Seed used for initialization of LoRA weights.28542855Useful if you need deterministic or reproducible initialization of weights.2856"""28572858train_unembed: bool = True2859"""Whether to add lora to the unembedding layer"""28602861train_mlp: bool = True2862"""Whether to add loras to the MLP layers (including MoE layers)"""28632864train_attn: bool = True2865"""Whether to add loras to the attention layers"""2866```28672868## Type: LossFnInputs28692870```python2871LossFnInputs: TypeAlias = Dict[str, TensorData]2872```28732874## Type: LossFnOutput28752876```python2877LossFnOutput: TypeAlias = Dict[str, TensorData]2878```28792880## Type: LossFnType28812882```python2883LossFnType: TypeAlias = Literal["cross_entropy", "importance_sampling", "ppo", "cispo", "dro"]2884```28852886## Type: ModelData28872888```python2889class ModelData(BaseModel):2890arch: Optional[str] = None28912892model_name: Optional[str] = None28932894tokenizer_id: Optional[str] = None2895```28962897## Type: ModelID28982899```python2900ModelID: TypeAlias = str2901```29022903## Type: ModelInput29042905```python2906class ModelInput(StrictBase):2907chunks: List[ModelInputChunk]2908"""Sequence of input chunks (formerly TokenSequence)"""290929102911@classmethod2912def from_ints(cls, tokens: List[int]) -> "ModelInput":2913"""2914Create a ModelInput from a list of ints (tokens).2915"""2916return cls(chunks=[EncodedTextChunk(tokens=tokens)])29172918def to_ints(self) -> List[int]:2919"""2920Convert the ModelInput to a list of ints (tokens)2921Throws exception if there are any non-token chunks2922"""2923if not all(isinstance(chunk, EncodedTextChunk) for chunk in self.chunks):2924raise ValueError(f"to_ints only supported for ModelInput with EncodedTextChunks, got {[type(chunk) for chunk in self.chunks]}")2925return [token for chunk in self.chunks for token in chunk.tokens]29262927@property2928def length(self) -> int:2929"""2930Return the total context length used by this ModelInput.2931"""2932return sum(chunk.length for chunk in self.chunks)29332934@classmethod2935def empty(cls) -> "ModelInput":2936"""2937Create an empty ModelInput.2938"""2939return cls(chunks=[])29402941def append(self, chunk: ModelInputChunk) -> "ModelInput":2942"""2943Add a new chunk, return a new ModelInput.2944"""2945return ModelInput(chunks=self.chunks + [chunk])29462947def append_int(self, token: int) -> "ModelInput":2948"""2949Add a new token, return a new ModelInput.2950"""2951return self.append(EncodedTextChunk(tokens=[token]))2952```29532954## Type: ModelInputChunk29552956```python2957ModelInputChunk: TypeAlias = Annotated[2958Union[EncodedTextChunk, ImageAssetPointerChunk, ImageChunk], PropertyInfo(discriminator="type")2959]2960```29612962## Type: OptimStepResponse29632964```python2965class OptimStepResponse(BaseModel):2966metrics: Optional[Dict[str, float]] = None2967"""Optimization step metrics as key-value pairs"""2968```29692970## Type: SampleResponse29712972```python2973class SampleResponse(BaseModel):2974sequences: Sequence[SampledSequence]29752976type: Literal["sample"] = "sample"29772978prompt_logprobs: Optional[List[Optional[float]]] = None2979"""2980If prompt_logprobs was set to true in the request, logprobs are computed for2981every token in the prompt. The `prompt_logprobs` response contains a float322982value for every token in the prompt.2983"""29842985topk_prompt_logprobs: Optional[list[Optional[list[tuple[int, float]]]]] = None2986"""2987If topk_prompt_logprobs was set to a positive integer k in the request,2988the top-k logprobs are computed for every token in the prompt. The2989`topk_prompt_logprobs` response contains, for every token in the prompt,2990a list of up to k (token_id, logprob) tuples.2991"""2992```29932994## Type: SampledSequence29952996```python2997class SampledSequence(BaseModel):2998stop_reason: StopReason2999"""Reason why sampling stopped"""30003001tokens: List[int]3002"""List of generated token IDs"""30033004logprobs: Optional[List[float]] = None3005"""Log probabilities for each token (optional)"""3006```30073008## Type: SamplingParams30093010```python3011class SamplingParams(BaseModel):3012max_tokens: Optional[int] = None3013"""Maximum number of tokens to generate"""30143015seed: Optional[int] = None3016"""Random seed for reproducible generation"""30173018stop: Union[str, Sequence[str], Sequence[int], None] = None3019"""Stop sequences for generation"""30203021temperature: float = 13022"""Sampling temperature"""30233024top_k: int = -13025"""Top-k sampling parameter (-1 for no limit)"""30263027top_p: float = 13028"""Nucleus sampling probability"""3029```30303031## Type: SaveWeightsForSamplerResponse30323033```python3034class SaveWeightsForSamplerResponse(BaseModel):3035path: str3036"""A tinker URI for model weights for sampling at a specific step"""30373038type: Optional[Literal["save_weights_for_sampler"]] = None3039```30403041## Type: SaveWeightsResponse30423043```python3044class SaveWeightsResponse(BaseModel):3045path: str3046"""A tinker URI for model weights at a specific step"""30473048type: Optional[Literal["save_weights"]] = None3049```30503051## Type: StopReason30523053```python3054StopReason: TypeAlias = Literal["length", "stop"]3055```30563057## Type: SupportedModel30583059```python3060class SupportedModel(BaseModel):3061model_name: Optional[str] = None3062```30633064## Type: TensorData30653066```python3067class TensorData(StrictBase):3068data: Union[List[int], List[float]]3069"""Flattened tensor data as array of numbers."""30703071dtype: TensorDtype30723073shape: Optional[List[int]] = None3074"""Optional.30753076The shape of the tensor (see PyTorch tensor.shape). The shape of a3077one-dimensional list of length N is `(N,)`. Can usually be inferred if not3078provided, and is generally inferred as a 1D tensor.3079"""30803081@classmethod3082def from_numpy(cls, array: npt.NDArray[Any]) -> "TensorData":3083return cls(3084data=array.flatten().tolist(),3085dtype=_convert_numpy_dtype_to_tensor(array.dtype),3086shape=list(array.shape),3087)30883089@classmethod3090def from_torch(cls, tensor: "torch.Tensor") -> "TensorData":3091return cls(3092data=tensor.flatten().tolist(),3093dtype=_convert_torch_dtype_to_tensor(tensor.dtype),3094shape=list(tensor.shape),3095)30963097def to_numpy(self) -> npt.NDArray[Any]:3098"""Convert TensorData to numpy array."""3099numpy_dtype = _convert_tensor_dtype_to_numpy(self.dtype)3100arr = np.array(self.data, dtype=numpy_dtype)3101if self.shape is not None:3102arr = arr.reshape(self.shape)3103return arr31043105def to_torch(self) -> "torch.Tensor":3106"""Convert TensorData to torch tensor."""3107if not _HAVE_TORCH:3108raise ImportError("PyTorch is not installed. Cannot convert to torch tensor.")31093110torch_dtype = _convert_tensor_dtype_to_torch(self.dtype)3111tensor = torch.tensor(self.data, dtype=torch_dtype)3112if self.shape is not None:3113tensor = tensor.reshape(self.shape)3114return tensor31153116def tolist(self) -> List[Any]:3117return self.to_numpy().tolist()311831193120def _convert_tensor_dtype_to_numpy(dtype: TensorDtype) -> npt.DTypeLike:3121"""Convert TensorDtype to numpy dtype-like."""3122if dtype == "float32":3123return np.float323124elif dtype == "int64":3125return np.int643126else:3127raise ValueError(f"Unsupported TensorDtype: {dtype}")312831293130def _convert_tensor_dtype_to_torch(dtype: TensorDtype) -> "torch.dtype":3131"""Convert TensorDtype to torch dtype."""3132if not _HAVE_TORCH:3133raise ImportError("PyTorch is not installed. Cannot convert to torch dtype.")3134import torch31353136if dtype == "float32":3137return torch.float323138elif dtype == "int64":3139return torch.int643140else:3141raise ValueError(f"Unsupported TensorDtype: {dtype}")314231433144def _convert_numpy_dtype_to_tensor(dtype: np.dtype[Any]) -> TensorDtype:3145"""Convert numpy dtype to TensorDtype."""3146if dtype.kind == "f":3147return "float32"3148elif dtype.kind == "i":3149return "int64"3150else:3151raise ValueError(f"Unsupported numpy dtype: {dtype}")315231533154def _convert_torch_dtype_to_tensor(dtype: "torch.dtype") -> TensorDtype:3155"""Convert torch dtype to TensorDtype."""3156# torch.dtype objects have .is_floating_point3157if getattr(dtype, "is_floating_point", False):3158return "float32"3159else:3160return "int64"3161```31623163## Type: TensorDtype31643165```python3166TensorDtype: TypeAlias = Literal["int64", "float32"]3167```31683169## Type: UnloadModelResponse31703171```python3172class UnloadModelResponse(BaseModel):3173model_id: ModelID31743175type: Optional[Literal["unload_model"]] = None3176```3177