Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval IFEval Instructions Util

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_IFEval_Instructions_Util.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Natural_Language_Processing, Text_Processing, Model_Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for text processing and analysis in IFEval instruction following evaluation.

Description

This module provides utility functions supporting IFEval instruction validation including sentence splitting, word/sentence counting, keyword generation, and language code mappings. The implementation uses NLTK for tokenization and includes a comprehensive word list of 1,561 common English words for generating random keywords. It also provides ISO 639-1 language code mappings for 30 languages and specialized text processing functions for handling LaTeX-style formatting, abbreviations, and various punctuation patterns.

Usage

Use these utilities when implementing instruction checkers that need to count sentences/words, generate random keywords, split text into sentences, or work with language codes. The functions support the instruction validation logic in the IFEval framework.

Code Reference

Source Location

Signature

def split_into_sentences(text: str) -> list[str]:
    """Split the text into sentences."""
    ...

def count_words(text: str) -> int:
    """Counts the number of words."""
    ...

def count_sentences(text: str) -> int:
    """Count the number of sentences."""
    ...

def generate_keywords(num_keywords: int) -> list[str]:
    """Randomly generates a few keywords."""
    ...

def download_nltk_resources() -> None:
    """Download 'punkt' if not already installed"""
    ...

# Constants
WORD_LIST: list[str]  # 1,561 common English words
LANGUAGE_CODES: immutabledict  # 30 language codes to names

Import

from lmms_eval.tasks.ifeval import instructions_util

# Or import specific functions
from lmms_eval.tasks.ifeval.instructions_util import (
    split_into_sentences,
    count_words,
    count_sentences,
    generate_keywords,
    WORD_LIST,
    LANGUAGE_CODES,
)

I/O Contract

Inputs

Name Type Required Description
text str Yes Text to process (for split_into_sentences, count_words, count_sentences)
num_keywords int Yes Number of random keywords to generate (for generate_keywords)

Outputs

Name Type Description
sentences list[str] List of sentences (from split_into_sentences)
word_count int Number of words in text (from count_words)
sentence_count int Number of sentences in text (from count_sentences)
keywords list[str] List of randomly sampled keywords (from generate_keywords)

Core Functions

Text Processing

split_into_sentences(text)

  • Splits text into sentences handling complex cases
  • Handles abbreviations (Mr., Dr., Ph.D., etc.)
  • Handles acronyms (e.g., U.S.A.)
  • Handles decimal numbers (e.g., 3.14)
  • Handles websites (e.g., example.com)
  • Handles quotation marks and punctuation
  • Returns list of sentence strings

count_words(text)

  • Counts words using NLTK RegexpTokenizer
  • Pattern: r"\w+" (word characters)
  • Returns integer count

count_sentences(text)

  • Uses NLTK's punkt tokenizer
  • Cached with functools.lru_cache
  • Returns integer count

generate_keywords(num_keywords)

  • Randomly samples from WORD_LIST
  • Returns list of keyword strings
  • Uses random.sample for unique selection

Constants

WORD_LIST

  • Contains 1,561 common English words
  • Used for generating random keywords
  • Includes nouns, verbs, adjectives, and common words

LANGUAGE_CODES

  • Immutable dictionary mapping ISO 639-1 codes to language names
  • Supports 30 languages including:
 * English (en), Spanish (es), French (fr), German (de)
 * Japanese (ja), Chinese (zh), Arabic (ar), Hindi (hi)
 * And 22 additional languages
  • Used by ResponseLanguageChecker

Usage Examples

# Example 1: Split text into sentences
text = "Hello world. This is Dr. Smith. He works at U.S.A. The value is 3.14."
sentences = split_into_sentences(text)
# Returns: ['Hello world.', 'This is Dr. Smith.', 'He works at U.S.A.', 'The value is 3.14.']

# Example 2: Count words
text = "The quick brown fox jumps over the lazy dog."
num_words = count_words(text)
# Returns: 9

# Example 3: Count sentences
text = "First sentence. Second sentence! Third sentence?"
num_sentences = count_sentences(text)
# Returns: 3

# Example 4: Generate random keywords
keywords = generate_keywords(num_keywords=3)
# Returns: ['mountain', 'coffee', 'computer'] (example - random each time)

# Example 5: Access language codes
from lmms_eval.tasks.ifeval.instructions_util import LANGUAGE_CODES

language_name = LANGUAGE_CODES['fr']
# Returns: 'French'

all_languages = list(LANGUAGE_CODES.keys())
# Returns: ['en', 'es', 'pt', 'ar', 'hi', 'fr', ...]

# Example 6: Use in instruction checker
from lmms_eval.tasks.ifeval import instructions_util

class CustomChecker:
    def check_word_count(self, response):
        word_count = instructions_util.count_words(response)
        return word_count >= 100

    def check_sentence_count(self, response):
        sentence_count = instructions_util.count_sentences(response)
        return sentence_count >= 5

Implementation Details

Sentence Splitting Algorithm

The split_into_sentences function uses a sophisticated regex-based approach:

1. Preprocessing - Adds spaces and replaces newlines 2. Abbreviation handling - Marks abbreviations with <prd> placeholder 3. Website handling - Protects domain extensions (.com, .org, etc.) 4. Number handling - Protects decimal points in numbers 5. Multiple dots - Handles ellipsis and multiple dots 6. Acronym handling - Protects periods in acronyms 7. Special cases - Handles Ph.D. specially 8. Sentence boundaries - Splits on periods, question marks, exclamation marks 9. Cleanup - Restores protected periods and strips whitespace

Word Counting

Uses NLTK's RegexpTokenizer with pattern r"\w+" which:

  • Matches word characters (letters, digits, underscores)
  • Excludes punctuation and whitespace
  • Handles contractions and hyphenated words

NLTK Resource Management

The module automatically downloads required NLTK resources:

  • Checks for 'punkt' tokenizer availability
  • Downloads on first use if not found
  • Uses try-except to avoid repeated download attempts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment