Implementation:Tencent Ncnn YOLO World Example

Knowledge Sources	Tencent_Ncnn
Domains	Vision, Open_Vocabulary_Detection
Last Updated	2026-02-09 19:00 GMT

Overview

Concrete tool for open-vocabulary object detection using YOLO-World with ncnn.

Description

This example implements YOLO-World open-vocabulary object detection using the ncnn inference framework. Unlike traditional closed-set detectors, YOLO-World can detect objects based on text prompts, representing a newer paradigm in object detection. The implementation supports both v1 and v2 model variants in multiple sizes (s, m, l, x). The output tensor uses a transposed layout (84, 8400) compared to other YOLOv8 variants, with rows for center-x, center-y, width, height, and 80 class scores. Input images are preprocessed with letterbox padding to 640x640 resolution. The model decodes center-format bounding boxes directly (not DFL regression), applies confidence filtering and NMS for final detections.

Usage

Use this example when you need flexible object detection that can be adapted to detect novel object categories through text descriptions, or when you want a single model that covers a broad vocabulary of object classes on mobile and edge devices.

Code Reference

Source Location

Repository: Tencent_Ncnn
File: examples/yoloworld.cpp
Lines: 1-393

Signature

struct Object
{
    cv::Rect_<float> rect;
    int label;
    float prob;
};

static int detect_yoloworld(const cv::Mat& bgr, std::vector<Object>& objects);

static void generate_proposals(const ncnn::Mat& pred, float prob_threshold,
                               std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
                               std::vector<int>& picked, float nms_threshold,
                               bool agnostic = false);

Import

#include "layer.h"
#include "net.h"

I/O Contract

Inputs

Name	Type	Required	Description
image_path	const char*	Yes	Path to input image file

Outputs

Name	Type	Description
objects	std::vector<Object>	Detected objects with bounding boxes, class labels, and confidence scores

Model Files

File	Description
yolov8s_worldv2.ncnn.param	YOLO-World v2 small model parameter file (default)
yolov8s_worldv2.ncnn.bin	YOLO-World v2 small model weight file (default)

Usage Examples

Running the Example

./yoloworld image.jpg

Key Code Pattern

ncnn::Net yoloworld;
yoloworld.opt.use_vulkan_compute = true;

yoloworld.load_param("yolov8s_worldv2.ncnn.param");
yoloworld.load_model("yolov8s_worldv2.ncnn.bin");

const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;

// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
    ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);

const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);

ncnn::Extractor ex = yoloworld.create_extractor();
ex.input("in0", in_pad);

ncnn::Mat out;
ex.extract("out0", out);

// Transposed output layout: (84, 8400)
// Rows 0-3: center-x, center-y, width, height
// Rows 4-83: 80 class scores
generate_proposals(out, prob_threshold, objects);

Implementation Details

Preprocessing

Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.

Output Tensor Layout

Unlike other YOLOv8 variants that use DFL regression, YOLO-World produces a single output tensor with a transposed layout:

pred (h=84, w=8400): 4 bbox coordinates (center-x, center-y, width, height) + 80 class scores
Bounding boxes are extracted from rows 0-3 using pred.row_range(0, 4)
Class scores are extracted from rows 4-83 using pred.row_range(4, num_class)

Proposal Generation

The generate_proposals function iterates over all 8400 boxes, finds the class with the highest score, and creates a proposal if it exceeds the confidence threshold. Bounding boxes are decoded directly from center-format (cx, cy, w, h) to corner-format (x, y, width, height).

Available Model Variants

Model	Version	Size
yolov8s-world	v1	Small
yolov8m-world	v1	Medium
yolov8l-world	v1	Large
yolov8x-world	v1	Extra-large
yolov8s-worldv2	v2	Small (default)
yolov8m-worldv2	v2	Medium
yolov8l-worldv2	v2	Large
yolov8x-worldv2	v2	Extra-large

Related Pages

Environment:Tencent_Ncnn_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment