Implementation:MarketSquare Robotframework browser Crawling Keywords
| Knowledge Sources | |
|---|---|
| Domains | Crawling, WebAutomation |
| Last Updated | 2026-02-12 05:40 GMT |
Overview
Provides the Crawl Site keyword that automatically discovers and visits all pages within a domain, executing a configurable keyword on each page.
Description
The Crawling class extends LibraryComponent and provides a single Robot Framework keyword Crawl Site for automated web crawling. The keyword accepts an optional starting url (if not provided, uses the current page URL), a page_crawl_keyword that is executed on every discovered page (defaults to take_screenshot), a max_number_of_page_to_crawl limit (default 1000), and a max_depth_to_crawl limit for consecutive link depth (default 50). The crawler uses a breadth-first-like approach: it maintains a list of URL/depth tuples to visit, restricts crawling to the same base URL domain, and tracks already-visited URLs to avoid duplicates. The internal _gather_links method extracts all anchor href values from the current page using XPath, filtering out download links via JavaScript evaluation. The _build_urls_to_crawl method filters new URLs against the base domain, maximum depth, and already-crawled or queued URLs. Each page visit invokes the configured keyword via Robot Framework's BuiltIn().run_keyword mechanism.
Usage
Use this keyword when you need to perform automated site-wide testing such as taking screenshots of every page, checking for broken pages, or performing accessibility audits across an entire web application domain.
Code Reference
Source Location
- Repository: MarketSquare_Robotframework_browser
- File: Browser/keywords/crawling.py
- Lines: 1-111
Signature
class Crawling(LibraryComponent):
def crawl_site(
self,
url: str | None = None,
page_crawl_keyword="take_screenshot",
max_number_of_page_to_crawl: int = 1000,
max_depth_to_crawl: int = 50,
):
Import
from Browser.keywords.crawling import Crawling
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| url | str or None | No | The starting URL to begin crawling from. If not provided, the current page URL is used. |
| page_crawl_keyword | str | No | Robot Framework keyword name to execute on every crawled page. Default is take_screenshot. |
| max_number_of_page_to_crawl | int | No | Upper limit on the number of pages to crawl. Crawling stops when this limit is reached. Default is 1000. |
| max_depth_to_crawl | int | No | Upper limit on consecutive link depth from the start page. Crawling stops when no links remain under this depth. Default is 50. |
Outputs
| Name | Type | Description |
|---|---|---|
| crawled_urls | list[str] | List of all URLs that were successfully crawled during the site crawl. |
Usage Examples
Robot Framework
*** Test Cases ***
Crawl Entire Site With Screenshots
New Browser chromium headless=true
New Page https://example.com
${crawled}= Crawl Site
Log Crawled ${crawled.__len__()} pages
Crawl With Custom Keyword And Limits
New Browser chromium headless=true
${crawled}= Crawl Site
... url=https://example.com
... page_crawl_keyword=My Custom Check
... max_number_of_page_to_crawl=100
... max_depth_to_crawl=5
Log List ${crawled}