Implementation:Ggml org Llama cpp Diffusion CLI

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Diffusion, Text_Generation
Last Updated	2026-02-15 00:00 GMT

Overview

CLI tool for text generation using Diffusion Language Models (DLLMs) with iterative denoising, supporting multiple diffusion algorithms and scheduling methods.

Description

This program implements diffusion-based text generation with five algorithms (CONFIDENCE_BASED, ENTROPY_BASED, MARGIN_BASED, RANDOM, ORIGIN) and two scheduling methods (TIMESTEP_BASED for Dream-style, BLOCK_BASED for LLaDA-style). It starts with masked tokens and iteratively unmasks them over configurable diffusion steps. The diffusion_params struct controls parameters including steps, temperature, top-k/top-p sampling, Gumbel noise injection, and classifier-free guidance. A calculate_confidence function computes per-token confidence scores to determine unmasking order.

Usage

Use this CLI tool to run non-autoregressive diffusion model architectures (Dream, LLaDA, RND1) with llama.cpp, expanding beyond traditional left-to-right token generation. It supports optional live visualization of the generation process.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/diffusion/diffusion-cli.cpp
Lines: 1-694

Signature

enum diffusion_algorithm {
    ORIGIN = 0, ENTROPY_BASED = 1, MARGIN_BASED = 2,
    RANDOM = 3, CONFIDENCE_BASED = 4
};

enum transfer_schedule {
    TIMESTEP_BASED = 0,  // Dream-style
    BLOCK_BASED    = 1,  // LLaDA-style
};

typedef bool (*diffusion_step_callback_t)(
    int32_t step, int32_t total_steps,
    const llama_token * tokens, int32_t n_tokens, void * user_data);

struct diffusion_params {
    int32_t steps;
    float temperature;
    llama_token mask_token_id;
    diffusion_algorithm algorithm;
    transfer_schedule schedule;
    float top_p;
    int32_t top_k;
    float cfg_scale;
    bool add_gumbel_noise;
    int32_t max_length;
};

static float calculate_confidence(
    const llama_token_data_array & cur_p,
    diffusion_algorithm algorithm,
    std::mt19937 & rng);

Import

#include "arg.h"
#include "chat.h"
#include "common.h"
#include "llama.h"
#include "log.h"
#include <algorithm>
#include <cmath>
#include <random>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
model	GGUF model file	Yes	Diffusion language model (Dream, LLaDA, or RND1 architecture)
prompt	std::string	Yes	Input text prompt to condition the diffusion process
steps	int32_t	No	Number of diffusion denoising steps (default: model-dependent)
temperature	float	No	Sampling temperature for token selection
algorithm	diffusion_algorithm	No	Unmasking strategy (default: CONFIDENCE_BASED)
schedule	transfer_schedule	No	Scheduling method (default: TIMESTEP_BASED)
top_p / top_k	float / int32_t	No	Nucleus and top-k sampling parameters
visual_mode	bool	No	Enable live visualization of the generation process

Outputs

Name	Type	Description
generated_text	std::string	Text generated through iterative diffusion denoising
step_callback	callback	Optional per-step callback for monitoring generation progress

Usage Examples

// Command-line usage:
// ./diffusion-cli -m dream-model.gguf -p "Once upon a time" \
//     --diffusion-steps 64 --temp 0.7 --diffusion-algo confidence

// With visual mode to see token unmasking in real time:
// ./diffusion-cli -m llada-model.gguf -p "The quick brown fox" \
//     --visual --diffusion-schedule block --block-length 32

Related Pages

Principle:Ggml_org_Llama_cpp_DiffusionGeneration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment