Implementation:Tencent Ncnn YOLO World Example
| Knowledge Sources | |
|---|---|
| Domains | Vision, Open_Vocabulary_Detection |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Concrete tool for open-vocabulary object detection using YOLO-World with ncnn.
Description
This example implements YOLO-World open-vocabulary object detection using the ncnn inference framework. Unlike traditional closed-set detectors, YOLO-World can detect objects based on text prompts, representing a newer paradigm in object detection. The implementation supports both v1 and v2 model variants in multiple sizes (s, m, l, x). The output tensor uses a transposed layout (84, 8400) compared to other YOLOv8 variants, with rows for center-x, center-y, width, height, and 80 class scores. Input images are preprocessed with letterbox padding to 640x640 resolution. The model decodes center-format bounding boxes directly (not DFL regression), applies confidence filtering and NMS for final detections.
Usage
Use this example when you need flexible object detection that can be adapted to detect novel object categories through text descriptions, or when you want a single model that covers a broad vocabulary of object classes on mobile and edge devices.
Code Reference
Source Location
- Repository: Tencent_Ncnn
- File: examples/yoloworld.cpp
- Lines: 1-393
Signature
struct Object
{
cv::Rect_<float> rect;
int label;
float prob;
};
static int detect_yoloworld(const cv::Mat& bgr, std::vector<Object>& objects);
static void generate_proposals(const ncnn::Mat& pred, float prob_threshold,
std::vector<Object>& objects);
static void qsort_descent_inplace(std::vector<Object>& objects);
static void nms_sorted_bboxes(const std::vector<Object>& objects,
std::vector<int>& picked, float nms_threshold,
bool agnostic = false);
Import
#include "layer.h"
#include "net.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image_path | const char* | Yes | Path to input image file |
Outputs
| Name | Type | Description |
|---|---|---|
| objects | std::vector<Object> | Detected objects with bounding boxes, class labels, and confidence scores |
Model Files
| File | Description |
|---|---|
| yolov8s_worldv2.ncnn.param | YOLO-World v2 small model parameter file (default) |
| yolov8s_worldv2.ncnn.bin | YOLO-World v2 small model weight file (default) |
Usage Examples
Running the Example
./yoloworld image.jpg
Key Code Pattern
ncnn::Net yoloworld;
yoloworld.opt.use_vulkan_compute = true;
yoloworld.load_param("yolov8s_worldv2.ncnn.param");
yoloworld.load_model("yolov8s_worldv2.ncnn.bin");
const int target_size = 640;
const float prob_threshold = 0.25f;
const float nms_threshold = 0.45f;
// Letterbox pad to 640x640
ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data,
ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);
const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
in_pad.substract_mean_normalize(0, norm_vals);
ncnn::Extractor ex = yoloworld.create_extractor();
ex.input("in0", in_pad);
ncnn::Mat out;
ex.extract("out0", out);
// Transposed output layout: (84, 8400)
// Rows 0-3: center-x, center-y, width, height
// Rows 4-83: 80 class scores
generate_proposals(out, prob_threshold, objects);
Implementation Details
Preprocessing
Input images are resized while preserving aspect ratio and letterbox padded to 640x640 (a multiple of max_stride=32). Pixel values are converted from BGR to RGB and normalized by dividing by 255. The padding fill value is 114.
Output Tensor Layout
Unlike other YOLOv8 variants that use DFL regression, YOLO-World produces a single output tensor with a transposed layout:
- pred (h=84, w=8400): 4 bbox coordinates (center-x, center-y, width, height) + 80 class scores
- Bounding boxes are extracted from rows 0-3 using
pred.row_range(0, 4) - Class scores are extracted from rows 4-83 using
pred.row_range(4, num_class)
Proposal Generation
The generate_proposals function iterates over all 8400 boxes, finds the class with the highest score, and creates a proposal if it exceeds the confidence threshold. Bounding boxes are decoded directly from center-format (cx, cy, w, h) to corner-format (x, y, width, height).
Available Model Variants
| Model | Version | Size |
|---|---|---|
| yolov8s-world | v1 | Small |
| yolov8m-world | v1 | Medium |
| yolov8l-world | v1 | Large |
| yolov8x-world | v1 | Extra-large |
| yolov8s-worldv2 | v2 | Small (default) |
| yolov8m-worldv2 | v2 | Medium |
| yolov8l-worldv2 | v2 | Large |
| yolov8x-worldv2 | v2 | Extra-large |