Principle:Cleanlab Cleanlab Span Classification Issue Detection
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Span Classification, Data Quality |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Detecting label issues in span classification by reducing the problem to token-level binary classification and applying confident learning principles to identify tokens whose span annotations are likely incorrect.
Description
Span classification is a natural language processing task where contiguous sequences of tokens (spans) in text are annotated with a class label, such as named entity recognition with a single entity type. Label issues in span classification manifest as tokens that are incorrectly included in or excluded from spans. Detecting these issues requires reasoning about the agreement between model-predicted span probabilities and the given annotations at the token level.
The core insight enabling span classification issue detection is that span classification with a single span class can be reduced to a binary token classification problem: each token is either part of a span (class 1) or not (class 0). This reduction allows the full machinery of cleanlab's token classification pipeline -- which implements confident learning for sequential data -- to be applied directly.
The key transformation is converting span probabilities into token classification probabilities. A span classification model outputs a single probability per token representing the likelihood that the token belongs to a span. Token classification, however, expects a probability distribution over classes for each token. The conversion creates this distribution by stacking the complement probability with the span probability: [1-p, p] for each token probability p.
Usage
This approach is the right choice when:
- You have a span classification or NER dataset with a single entity type (binary span annotations).
- Your model provides per-token probabilities of belonging to a span.
- You want to identify specific tokens whose span boundary annotations may be incorrect.
- You want to rank sentences by overall annotation quality to prioritize review efforts.
It is currently limited to single span class scenarios. For multi-class span classification, a more general approach would be needed.
Theoretical Basis
Problem Reduction
The span classification issue detection problem is solved by a two-step reduction:
- Span to Token Reduction: Each token's span annotation becomes a binary classification label. A token labeled as part of a span receives label 1; tokens outside spans receive label 0.
- Probability Format Transformation: The model's single span probability
pper token is expanded to a two-class distribution[1-p, p], where column 0 represents the probability of not being in a span and column 1 represents the probability of being in a span.
Confident Learning at the Token Level
Once the span classification data is in token classification format, the underlying confident learning algorithm identifies label issues by comparing the given label for each token against the model's predicted probabilities. A token is flagged as a label issue when its predicted probability strongly disagrees with its given label.
The quality score for each token quantifies this agreement:
- A token labeled as in-span (1) with high predicted span probability receives a high quality score.
- A token labeled as in-span (1) with low predicted span probability receives a low quality score (likely mislabeled).
- Similarly for tokens labeled as not-in-span (0).
Sentence-level quality scores are derived by aggregating the quality scores of all tokens within the sentence, allowing practitioners to prioritize which sentences to review first.
Adapter Pattern
This approach demonstrates a powerful software design principle: by implementing a thin adapter layer that transforms the input format, an existing well-tested pipeline (token classification) can be reused for a related but distinct task (span classification). The adapter introduces no new algorithmic logic -- it only reshapes the data to match the expected interface, ensuring that all the benefits and correctness guarantees of the underlying token classification module are preserved.