Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Dotnet Machinelearning GenAI Causal LM Inference

From Leeroopedia


Knowledge Sources
Domains Generative_AI, LLMs, Text_Generation
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for loading and running causal language models (Phi, LLaMA, Mistral) locally in .NET for text generation using ML.NET's GenAI packages with TorchSharp backend.

Description

This workflow outlines the procedure for performing local large language model (LLM) inference within a .NET application using ML.NET's GenAI packages. It covers downloading pre-trained model weights from Hugging Face, loading the model architecture and tokenizer, constructing a CausalLMPipeline for text generation, and integrating the pipeline with higher-level frameworks like Semantic Kernel (as an IChatCompletionService) or AutoGen.Net (as an agent). The supported model families are Microsoft Phi (Phi-2, Phi-3 Mini/Medium), Meta LLaMA (3.1/3.2 from 1B to 405B), and Mistral (7B). All models run via TorchSharp with support for CPU and CUDA GPU execution, including dynamic layer offloading for memory-constrained environments.

Usage

Execute this workflow when you need to run an LLM locally within a .NET application for text generation, chat, code generation, or content creation without depending on external cloud API services. This is suitable for privacy-sensitive applications, offline scenarios, local development and testing, benchmarking against cloud models, or building multi-agent systems with AutoGen.Net.

Execution Steps

Step 1: Download Model Weights

Obtain pre-trained model weights from Hugging Face for the desired model (e.g., Phi-3-mini-4k-instruct, LLaMA-3.1-8B-Instruct, Mistral-7B-Instruct). The download includes the model configuration JSON, SafeTensors weight files, and tokenizer files.

Key considerations:

  • Use git-lfs to clone the full model repository from Hugging Face
  • Model sizes range from approximately 2GB (Phi-3-mini) to hundreds of GB (LLaMA-405B)
  • The config.json file defines the model architecture parameters (hidden size, number of layers, attention heads)
  • SafeTensors format is used for weight storage with memory-mapped loading support
  • Ensure sufficient disk space for the model weights

Step 2: Load Model Configuration and Architecture

Read the model configuration from config.json and instantiate the appropriate model class (Phi3ForCausalLM, LLaMA3CausalLM, or MistralForCausalLM). The configuration specifies architecture parameters like hidden dimensions, number of transformer layers, vocabulary size, and attention head counts.

Key considerations:

  • Each model family has its own configuration class (Phi3Config, LLaMAConfig, MistralConfig)
  • The model architecture is built in-memory without weights initially
  • Configuration files for supported models are embedded as resources in the GenAI packages
  • Verify that the config matches the downloaded weights version

Step 3: Load Tokenizer

Initialize the tokenizer that converts between text strings and token IDs. Each model family uses a specific tokenizer implementation (LLama2Tokenizer for Phi-3 and LLaMA, SentencePiece-based). Load the tokenizer model file (tokenizer.model) from the downloaded weights folder.

Key considerations:

  • The tokenizer must match the model; using a mismatched tokenizer produces incorrect results
  • Tokenizers handle special tokens (BOS, EOS, user/assistant markers) specific to each model
  • Chat templates differ between models (Phi-3 uses special tags like user and end markers)
  • The tokenizer determines the vocabulary size which must match the model embedding layer

Step 4: Load Model Weights and Configure Device

Load the SafeTensors weight files into the model architecture and configure the execution device (CPU or CUDA GPU). For GPU execution, initialize the CUDA device. For memory-constrained environments, use dynamic loading to distribute layers between GPU and CPU memory based on available resources.

Key considerations:

  • LoadSafeTensors maps weight files into the model layers
  • GPU execution requires initializing DeviceType.CUDA via torch.InitializeDeviceType
  • Phi-3-mini-4k-instruct requires approximately 8GB GPU memory when fully loaded
  • Dynamic loading (ToDynamicLoadingModel) enables inference when GPU memory is insufficient by swapping layers between CPU and GPU
  • InferDeviceMapForEachLayer computes the optimal layer-to-device assignment based on available memory

Step 5: Create CausalLMPipeline

Combine the tokenizer and model into a CausalLMPipeline, which provides a unified interface for text generation with configurable decoding strategies (sampling with temperature and top-p, greedy search, beam search).

Key considerations:

  • CausalLMPipeline is generic over tokenizer and model types for type safety
  • The pipeline handles tokenization, forward pass, and decoding in a single Generate() call
  • Temperature controls randomness (lower = more deterministic, higher = more creative)
  • Top-p (nucleus sampling) limits token selection to the smallest set whose cumulative probability exceeds p
  • Stop token sequences can be specified to control generation termination

Step 6: Generate Text or Integrate with Frameworks

Use the pipeline directly for text generation, or integrate it with higher-level AI frameworks. Three integration paths are available: direct generation via Generate(), Semantic Kernel integration as IChatCompletionService for chat-based applications, and AutoGen.Net integration as an agent for multi-agent orchestration. The pipeline can also be exposed as an OpenAI-compatible REST endpoint for benchmarking.

Key considerations:

  • Direct Generate() accepts a prompt string and returns generated text
  • Semantic Kernel integration uses AddGenAIChatCompletion to register the pipeline as a chat service
  • AutoGen.Net integration wraps the pipeline in a Phi3Agent (or similar) for agent-based interactions
  • OpenAI-compatible endpoint enables evaluation with Python benchmarking frameworks
  • Chat templates must be applied correctly for instruction-following models (handled automatically by framework integrations)

Execution Diagram

GitHub URL

Workflow Repository