Implementation:Ggml org Llama cpp Diffusion CLI
| Knowledge Sources | |
|---|---|
| Domains | Diffusion, Text_Generation |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
CLI tool for text generation using Diffusion Language Models (DLLMs) with iterative denoising, supporting multiple diffusion algorithms and scheduling methods.
Description
This program implements diffusion-based text generation with five algorithms (CONFIDENCE_BASED, ENTROPY_BASED, MARGIN_BASED, RANDOM, ORIGIN) and two scheduling methods (TIMESTEP_BASED for Dream-style, BLOCK_BASED for LLaDA-style). It starts with masked tokens and iteratively unmasks them over configurable diffusion steps. The diffusion_params struct controls parameters including steps, temperature, top-k/top-p sampling, Gumbel noise injection, and classifier-free guidance. A calculate_confidence function computes per-token confidence scores to determine unmasking order.
Usage
Use this CLI tool to run non-autoregressive diffusion model architectures (Dream, LLaDA, RND1) with llama.cpp, expanding beyond traditional left-to-right token generation. It supports optional live visualization of the generation process.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/diffusion/diffusion-cli.cpp
- Lines: 1-694
Signature
enum diffusion_algorithm {
ORIGIN = 0, ENTROPY_BASED = 1, MARGIN_BASED = 2,
RANDOM = 3, CONFIDENCE_BASED = 4
};
enum transfer_schedule {
TIMESTEP_BASED = 0, // Dream-style
BLOCK_BASED = 1, // LLaDA-style
};
typedef bool (*diffusion_step_callback_t)(
int32_t step, int32_t total_steps,
const llama_token * tokens, int32_t n_tokens, void * user_data);
struct diffusion_params {
int32_t steps;
float temperature;
llama_token mask_token_id;
diffusion_algorithm algorithm;
transfer_schedule schedule;
float top_p;
int32_t top_k;
float cfg_scale;
bool add_gumbel_noise;
int32_t max_length;
};
static float calculate_confidence(
const llama_token_data_array & cur_p,
diffusion_algorithm algorithm,
std::mt19937 & rng);
Import
#include "arg.h"
#include "chat.h"
#include "common.h"
#include "llama.h"
#include "log.h"
#include <algorithm>
#include <cmath>
#include <random>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | GGUF model file | Yes | Diffusion language model (Dream, LLaDA, or RND1 architecture) |
| prompt | std::string | Yes | Input text prompt to condition the diffusion process |
| steps | int32_t | No | Number of diffusion denoising steps (default: model-dependent) |
| temperature | float | No | Sampling temperature for token selection |
| algorithm | diffusion_algorithm | No | Unmasking strategy (default: CONFIDENCE_BASED) |
| schedule | transfer_schedule | No | Scheduling method (default: TIMESTEP_BASED) |
| top_p / top_k | float / int32_t | No | Nucleus and top-k sampling parameters |
| visual_mode | bool | No | Enable live visualization of the generation process |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_text | std::string | Text generated through iterative diffusion denoising |
| step_callback | callback | Optional per-step callback for monitoring generation progress |
Usage Examples
// Command-line usage:
// ./diffusion-cli -m dream-model.gguf -p "Once upon a time" \
// --diffusion-steps 64 --temp 0.7 --diffusion-algo confidence
// With visual mode to see token unmasking in real time:
// ./diffusion-cli -m llada-model.gguf -p "The quick brown fox" \
// --visual --diffusion-schedule block --block-length 32