🌐 Web ScrapingPro Plan

SCRAPE_BY_REGEX

Scrape content from a webpage using a regular expression pattern.

Formula Signature
=SCRAPE_BY_REGEX(url, regex, [removeHtml], [group], [renderJs])

Returns: string or 2D array (multiple matches returned as separate rows)

Overview

SCRAPE_BY_REGEX extracts content from webpages by applying a regular expression pattern against the page source. Unlike CSS or XPath selectors that navigate the document structure, regex operates directly on the raw HTML or text content, making it uniquely suited for extracting specific patterns like prices, phone numbers, email addresses, dates, product codes, API keys, and any other text that follows a predictable format.

Parameters

ParameterTypeRequiredDescription
urlstringYesThe full URL of the webpage to scrape (must include https:// or http://).
regexstringYesRegular expression pattern to match against the page content. Use standard regex syntax. Backslashes must be doubled in Sheets formulas (e.g., "\\d+" for digits).
removeHtmlbooleanNo (FALSE)Optional. Set to TRUE to strip all HTML tags before applying the regex, leaving only visible text content. Default is FALSE (regex applied to raw HTML source).
groupnumberNo (0)Optional. The capture group to return. 0 returns the full match, 1 returns the first capture group, 2 the second, etc. Default is 0.
renderJsbooleanNoOptional. Set to TRUE to render JavaScript before applying the regex. Required for dynamically loaded content. Slower than standard mode.

Examples

1

Extract all prices from a page

Matches dollar amounts in the format $XX.XX across the entire page. Returns all matches as separate rows, making it easy to see all prices at a glance.

fx
=SCRAPE_BY_REGEX("https://store.example.com/deals", "\$[0-9,]+\.\d{2}")

Output

$49.99
$129.00
$24.95
$89.50
2

Extract email addresses from a contact page

Uses a standard email regex pattern with removeHtml enabled to find all email addresses on the page regardless of how they are wrapped in HTML.

fx
=SCRAPE_BY_REGEX("https://example.com/contact", "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", TRUE)

Output

info@example.com
support@example.com
sales@example.com
3

Extract values using a capture group

Matches the pattern "SKU: ABC-123" and returns only the capture group (the SKU code itself, without the "SKU:" prefix). Group 1 isolates the value inside the parentheses.

fx
=SCRAPE_BY_REGEX("https://store.example.com/product/123", "SKU:\s*([A-Z0-9-]+)", TRUE, 1)

Output

WH-1000XM5
4

Extract phone numbers from business listings

Matches US phone numbers in the format (XXX) XXX-XXXX. The removeHtml parameter ensures phone numbers split across HTML tags are properly captured.

fx
=SCRAPE_BY_REGEX("https://directory.example.com/plumbers", "\(\d{3}\)\s*\d{3}-\d{4}", TRUE)

Output

(555) 123-4567
(555) 987-6543
(555) 246-8135

Use Cases

E-commerce

Price Monitoring Across Multiple Retailers

Track product prices across different retailers using regex patterns that match each site's price format. Works even when HTML structures differ between sites, since the regex targets the price pattern itself rather than the surrounding markup.

Sales

Lead Generation from Business Directories

Extract contact information (emails, phone numbers, addresses) from business directories and yellow pages. Regex patterns reliably match structured data formats like phone numbers and emails regardless of how they are displayed on the page.

Inventory Management

Product Catalog SKU Extraction

Scrape product codes, SKUs, and model numbers from supplier catalogs and manufacturer websites. Use regex to match company-specific code formats (e.g., "XX-####-YY") and build comprehensive product databases.

Legal

Legal Document Reference Extraction

Extract case numbers, statute references, and regulation citations from legal databases and court websites. Regex precisely matches citation formats like "42 U.S.C. 1983" or case docket numbers.

Software Development

Version Number and Changelog Tracking

Monitor software release pages and changelogs to track version numbers, release dates, and update frequencies. Regex patterns like "v\d+\.\d+\.\d+" reliably extract semantic version numbers from any page format.

Marketing Research

Social Media Profile Data Collection

Extract follower counts, engagement metrics, and profile information from public social media pages. Regex handles the various number formats (1.2K, 1,200, 1200) used across different platforms.

Pro Tips

TIP

Always test your regex at regex101.com before using it in the spreadsheet. Paste the HTML source (or text content if using removeHtml) as the test string to verify your pattern matches correctly.

TIP

Use non-greedy quantifiers (*? and +?) instead of greedy ones (* and +) when matching content between delimiters. For example, "<b>.*?</b>" matches individual bold elements, while "<b>.*</b>" matches everything from the first <b> to the last </b>.

TIP

Combine removeHtml with simple text patterns for the most reliable results. Regex on clean text is much easier to write and maintain than regex on raw HTML with all its attributes and nested tags.

TIP

Use capture groups strategically: wrap only the part you want to extract in parentheses and set the group parameter accordingly. This lets you use surrounding text as context anchors without including it in the output.

TIP

For extracting structured data like tables, consider using SCRAPE_BY_CSS_PATH or SCRAPE_BY_XPATH instead. Regex works best for extracting specific data patterns (numbers, codes, emails) rather than navigating document structure.

The function fetches the target URL, optionally strips HTML tags (when removeHtml is TRUE), and then applies your regex pattern. All matches are returned, with multiple matches appearing as separate rows in your spreadsheet. If you use capture groups in your pattern, specify which group to return with the group parameter, where 0 returns the full match and 1 returns the first capture group.

The removeHtml parameter is particularly useful when the data you need spans across HTML tags. For example, a price might be rendered as "<span class="currency">$</span><span class="amount">29.99</span>", which is difficult to match with a single regex on raw HTML. Setting removeHtml to TRUE collapses this into "$29.99", making your regex pattern simpler and more reliable.

For websites that generate content with JavaScript, enable the renderJs parameter to use a headless browser. This ensures the regex is applied against the fully rendered page content rather than just the initial HTML source.

SCRAPE_BY_REGEX excels in scenarios where the target data does not have consistent HTML structure but does have a consistent text format. It is also valuable as a fallback when CSS and XPath selectors fail due to obfuscated or dynamically generated markup. Common applications include price monitoring, contact information extraction, SKU and product code collection, and data validation workflows.

Common Errors

No matches found

Cause: The regex pattern does not match any content on the page. This can happen if the pattern is too specific, the content is rendered via JavaScript, or HTML tags are interfering with text matching.

Fix: Try enabling removeHtml to match against plain text instead of raw HTML. Verify your regex at regex101.com using the page source as the test string. If the content is dynamically loaded, enable renderJs.

Error: URL and regex are required

Cause: One or both required parameters are missing or empty.

Fix: Ensure both the URL (including protocol) and regex pattern are provided as non-empty strings. Double-check that cell references point to cells with values.

Error: Invalid regular expression

Cause: The regex pattern contains a syntax error such as unmatched parentheses, invalid escape sequences, or malformed character classes.

Fix: Validate your regex at regex101.com. Common issues include unescaped special characters (parentheses, brackets, dots need backslashes if used literally), unmatched group delimiters, and invalid quantifier positions.

Frequently Asked Questions

In Google Sheets formulas, strings are enclosed in double quotes. Since regex uses backslashes for special characters (\d for digits, \s for whitespace), and the Sheets formula parser does not interpret backslashes as escape characters inside strings, you write them as single backslashes: "\d+" matches one or more digits. If you are seeing unexpected behavior, try the pattern with removeHtml set to TRUE to simplify the content being matched. Test your regex at regex101.com before using it in the function.

Group 0 (the default) returns the entire matched text. Group 1 returns only the text inside the first set of parentheses in your pattern. For example, with the pattern "Price: (\$[0-9.]+)" applied to the text "Price: $29.99", group 0 returns "Price: $29.99" and group 1 returns "$29.99". This is useful when you need to match a pattern for context but only extract part of it. You can have multiple capture groups (group 2, group 3, etc.) by adding more parenthesized sections to your pattern.

Enable removeHtml (set to TRUE) when: (1) The text you want to match is split across multiple HTML tags, such as a price displayed as "<span>$</span><span>29</span>.<span>99</span>". (2) HTML tags are interfering with your pattern matches. (3) You want to match against the visible text content only, ignoring all markup. (4) Your regex is designed for plain text, not HTML. Leave it as FALSE when you specifically need to match HTML attributes, tag names, or markup structure.

The regex engine supports inline flags using the (?flags) syntax at the beginning of your pattern. Use "(?i)" for case-insensitive matching, "(?s)" for single-line mode (dot matches newlines), and "(?m)" for multiline mode. For example, "(?i)price:\s*\$[0-9.]+" matches "Price:", "PRICE:", and "price:" variants. You can combine flags: "(?im)" enables both case-insensitive and multiline modes.

By default, the regex is applied to the raw HTML source, which includes all tags, attributes, and markup. For example, a pattern like "[A-Z][a-z]+" intended to match capitalized words might also match tag names like "Div" or "Span". Set removeHtml to TRUE to strip all HTML tags first, so your regex only sees the visible text content. Alternatively, make your regex more specific to exclude HTML patterns.

The regex engine supports full PCRE-compatible syntax including lookaheads, lookbehinds, non-capturing groups, lazy quantifiers, and character classes. However, extremely complex patterns with excessive backtracking (such as nested quantifiers like "(a+)+" ) can cause timeout errors. Keep patterns as simple and specific as possible. If you find yourself writing very long regex patterns, consider using SCRAPE_BY_CSS_PATH or SCRAPE_BY_XPATH to narrow down the content first, then use regex on the result.

REGEXEXTRACT is a native Google Sheets function that applies a regex to text already in your spreadsheet. SCRAPE_BY_REGEX fetches a webpage and applies the regex to the page content, combining web fetching and extraction in a single step. You could achieve similar results by using SCRAPE_BY_CSS_PATH to get the page text and then REGEXEXTRACT on the result, but SCRAPE_BY_REGEX is more efficient as it processes everything server-side in one request and can return multiple matches.

By default, the dot (.) in regex does not match newline characters. If you need to match across line breaks, use the inline flag "(?s)" at the start of your pattern to enable single-line mode, where dot matches any character including newlines. For example, "(?s)<div class="product">.*?</div>" matches a div and all its content across multiple lines. The (?m) flag makes ^ and $ match the start/end of each line rather than the entire string.

Related Functions

Start using SCRAPE_BY_REGEX today

Install Unlimited Sheets to get SCRAPE_BY_REGEX and 41 other powerful functions in Google Sheets.