Implementation:MarketSquare Robotframework browser Crawling Keywords

Knowledge Sources	MarketSquare_Robotframework_browser
Domains	Crawling, WebAutomation
Last Updated	2026-02-12 05:40 GMT

Overview

Provides the Crawl Site keyword that automatically discovers and visits all pages within a domain, executing a configurable keyword on each page.

Description

The Crawling class extends LibraryComponent and provides a single Robot Framework keyword Crawl Site for automated web crawling. The keyword accepts an optional starting url (if not provided, uses the current page URL), a page_crawl_keyword that is executed on every discovered page (defaults to take_screenshot), a max_number_of_page_to_crawl limit (default 1000), and a max_depth_to_crawl limit for consecutive link depth (default 50). The crawler uses a breadth-first-like approach: it maintains a list of URL/depth tuples to visit, restricts crawling to the same base URL domain, and tracks already-visited URLs to avoid duplicates. The internal _gather_links method extracts all anchor href values from the current page using XPath, filtering out download links via JavaScript evaluation. The _build_urls_to_crawl method filters new URLs against the base domain, maximum depth, and already-crawled or queued URLs. Each page visit invokes the configured keyword via Robot Framework's BuiltIn().run_keyword mechanism.

Usage

Use this keyword when you need to perform automated site-wide testing such as taking screenshots of every page, checking for broken pages, or performing accessibility audits across an entire web application domain.

Code Reference

Source Location

Repository: MarketSquare_Robotframework_browser
File: Browser/keywords/crawling.py
Lines: 1-111

Signature

class Crawling(LibraryComponent):

    def crawl_site(
        self,
        url: str | None = None,
        page_crawl_keyword="take_screenshot",
        max_number_of_page_to_crawl: int = 1000,
        max_depth_to_crawl: int = 50,
    ):

Import

from Browser.keywords.crawling import Crawling

I/O Contract

Inputs

Name	Type	Required	Description
url	str or None	No	The starting URL to begin crawling from. If not provided, the current page URL is used.
page_crawl_keyword	str	No	Robot Framework keyword name to execute on every crawled page. Default is take_screenshot.
max_number_of_page_to_crawl	int	No	Upper limit on the number of pages to crawl. Crawling stops when this limit is reached. Default is 1000.
max_depth_to_crawl	int	No	Upper limit on consecutive link depth from the start page. Crawling stops when no links remain under this depth. Default is 50.

Outputs

Name	Type	Description
crawled_urls	list[str]	List of all URLs that were successfully crawled during the site crawl.

Usage Examples

Robot Framework

*** Test Cases ***
Crawl Entire Site With Screenshots
    New Browser    chromium    headless=true
    New Page    https://example.com
    ${crawled}=    Crawl Site
    Log    Crawled ${crawled.__len__()} pages

Crawl With Custom Keyword And Limits
    New Browser    chromium    headless=true
    ${crawled}=    Crawl Site
    ...    url=https://example.com
    ...    page_crawl_keyword=My Custom Check
    ...    max_number_of_page_to_crawl=100
    ...    max_depth_to_crawl=5
    Log List    ${crawled}

Related Pages

Environment:MarketSquare_Robotframework_browser_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment