Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Tencent Ncnn YOLO World Example

From Leeroopedia


Knowledge Sources
Domains Vision, Open_Vocabulary_Detection
Last Updated 2026-02-09 19:00 GMT

Overview

Concrete tool for open-vocabulary object detection using YOLO-World with ncnn.

Description

This example implements YOLO-World open-vocabulary object detection using the ncnn inference framework. Unlike traditional closed-set detectors, YOLO-World can detect objects based on text prompts, representing a newer paradigm in object detection. The implementation supports both v1 and v2 model variants in multiple sizes (s, m, l, x). The output tensor uses a transposed layout (84, 8400) compared to other YOLOv8 variants, with rows for center-x, center-y, width, height, and 80 class scores. Input images are preprocessed with letterbox padding to 640x640 resolution. The model decodes center-format bounding boxes directly (not DFL regression), applies confidence filtering and NMS for final detections.

Usage

Use this example when you need flexible object detection that can be adapted to detect novel object categories through text descriptions, or when you want a single model that covers a broad vocabulary of object classes on mobile and edge devices.

Code Reference

Source Location

Signature

struct Object
{
    cv::Rect_<float> rect;
    int label;
    float prob;
};

static int detect_yoloworld(const cv::Mat& bgr, std::vector<Object>& objects);

static void generate_proposals(const ncnn::Mat& pred, float prob_threshold,
                               std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
                               std::vector<int>& picked, float nms_threshold,
                               bool agnostic = false);

Import

#include "layer.h"
#include "net.h"

I/O Contract

Inputs

Name Type Required Description
image_path const char* Yes Path to input image file

Outputs

Name Type Description
objects std::vector<Object> Detected objects with bounding boxes, class labels, and confidence scores

Model Files

File Description
yolov8s_worldv2.ncnn.param YOLO-World v2 small model parameter file (default)
yolov8s_worldv2.ncnn.bin YOLO-World v2 small model weight file (default)

Usage Examples

Running the Example

./yoloworld image.jpg

Key Code Pattern

ncnn::Net yoloworld;
yoloworld.opt.use_vulkan_compute = true;

yoloworld.load_param("yolov8s_worldv2.ncnn.param");
yoloworld.load_model("yolov8s_worldv2.ncnn.bin");

const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;

// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
    ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);

const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);

ncnn::Extractor ex = yoloworld.create_extractor();
ex.input("in0", in_pad);

ncnn::Mat out;
ex.extract("out0", out);

// Transposed output layout: (84, 8400)
// Rows 0-3: center-x, center-y, width, height
// Rows 4-83: 80 class scores
generate_proposals(out, prob_threshold, objects);

Implementation Details

Preprocessing

Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.

Output Tensor Layout

Unlike other YOLOv8 variants that use DFL regression, YOLO-World produces a single output tensor with a transposed layout:

  • pred (h=84, w=8400): 4 bbox coordinates (center-x, center-y, width, height) + 80 class scores
  • Bounding boxes are extracted from rows 0-3 using pred.row_range(0, 4)
  • Class scores are extracted from rows 4-83 using pred.row_range(4, num_class)

Proposal Generation

The generate_proposals function iterates over all 8400 boxes, finds the class with the highest score, and creates a proposal if it exceeds the confidence threshold. Bounding boxes are decoded directly from center-format (cx, cy, w, h) to corner-format (x, y, width, height).

Available Model Variants

Model Version Size
yolov8s-world v1 Small
yolov8m-world v1 Medium
yolov8l-world v1 Large
yolov8x-world v1 Extra-large
yolov8s-worldv2 v2 Small (default)
yolov8m-worldv2 v2 Medium
yolov8l-worldv2 v2 Large
yolov8x-worldv2 v2 Extra-large

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment