Question 1

How do I write regex patterns in Google Sheets formulas?

Accepted Answer

In Google Sheets formulas, strings are enclosed in double quotes. Since regex uses backslashes for special characters (\d for digits, \s for whitespace), and the Sheets formula parser does not interpret backslashes as escape characters inside strings, you write them as single backslashes: "\d+" matches one or more digits. If you are seeing unexpected behavior, try the pattern with removeHtml set to TRUE to simplify the content being matched. Test your regex at regex101.com before using it in the function.

Question 2

What is the difference between group 0 and group 1?

Accepted Answer

Group 0 (the default) returns the entire matched text. Group 1 returns only the text inside the first set of parentheses in your pattern. For example, with the pattern "Price: (\$[0-9.]+)" applied to the text "Price: $29.99", group 0 returns "Price: $29.99" and group 1 returns "$29.99". This is useful when you need to match a pattern for context but only extract part of it. You can have multiple capture groups (group 2, group 3, etc.) by adding more parenthesized sections to your pattern.

Question 3

When should I use removeHtml?

Accepted Answer

Enable removeHtml (set to TRUE) when: (1) The text you want to match is split across multiple HTML tags, such as a price displayed as "$29.99". (2) HTML tags are interfering with your pattern matches. (3) You want to match against the visible text content only, ignoring all markup. (4) Your regex is designed for plain text, not HTML. Leave it as FALSE when you specifically need to match HTML attributes, tag names, or markup structure.

Question 4

Can I use regex flags like case-insensitive matching?

Accepted Answer

The regex engine supports inline flags using the (?flags) syntax at the beginning of your pattern. Use "(?i)" for case-insensitive matching, "(?s)" for single-line mode (dot matches newlines), and "(?m)" for multiline mode. For example, "(?i)price:\s*\$[0-9.]+" matches "Price:", "PRICE:", and "price:" variants. You can combine flags: "(?im)" enables both case-insensitive and multiline modes.

Question 5

Why does my regex match HTML tags instead of the text I want?

Accepted Answer

By default, the regex is applied to the raw HTML source, which includes all tags, attributes, and markup. For example, a pattern like "[A-Z][a-z]+" intended to match capitalized words might also match tag names like "Div" or "Span". Set removeHtml to TRUE to strip all HTML tags first, so your regex only sees the visible text content. Alternatively, make your regex more specific to exclude HTML patterns.

Question 6

Is there a limit to how complex my regex can be?

Accepted Answer

The regex engine supports full PCRE-compatible syntax including lookaheads, lookbehinds, non-capturing groups, lazy quantifiers, and character classes. However, extremely complex patterns with excessive backtracking (such as nested quantifiers like "(a+)+" ) can cause timeout errors. Keep patterns as simple and specific as possible. If you find yourself writing very long regex patterns, consider using SCRAPE_BY_CSS_PATH or SCRAPE_BY_XPATH to narrow down the content first, then use regex on the result.

Question 7

How does SCRAPE_BY_REGEX compare to REGEXEXTRACT in Google Sheets?

Accepted Answer

REGEXEXTRACT is a native Google Sheets function that applies a regex to text already in your spreadsheet. SCRAPE_BY_REGEX fetches a webpage and applies the regex to the page content, combining web fetching and extraction in a single step. You could achieve similar results by using SCRAPE_BY_CSS_PATH to get the page text and then REGEXEXTRACT on the result, but SCRAPE_BY_REGEX is more efficient as it processes everything server-side in one request and can return multiple matches.

Question 8

Can the regex pattern match across multiple lines?

Accepted Answer

By default, the dot (.) in regex does not match newline characters. If you need to match across line breaks, use the inline flag "(?s)" at the start of your pattern to enable single-line mode, where dot matches any character including newlines. For example, "(?s)

.*?

" matches a div and all its content across multiple lines. The (?m) flag makes ^ and $ match the start/end of each line rather than the entire string.

Parameter	Type	Required	Description
`url`	string	Yes	The full URL of the webpage to scrape (must include https:// or http://).
`regex`	string	Yes	Regular expression pattern to match against the page content. Use standard regex syntax. Backslashes must be doubled in Sheets formulas (e.g., "\\d+" for digits).
`removeHtml`	boolean	No (FALSE)	Optional. Set to TRUE to strip all HTML tags before applying the regex, leaving only visible text content. Default is FALSE (regex applied to raw HTML source).
`group`	number	No (0)	Optional. The capture group to return. 0 returns the full match, 1 returns the first capture group, 2 the second, etc. Default is 0.
`renderJs`	boolean	No	Optional. Set to TRUE to render JavaScript before applying the regex. Required for dynamically loaded content. Slower than standard mode.

SCRAPE_BY_REGEX

Overview

Parameters

Examples

Extract all prices from a page

Extract email addresses from a contact page

Extract values using a capture group

Extract phone numbers from business listings

Use Cases

Price Monitoring Across Multiple Retailers

Lead Generation from Business Directories

Product Catalog SKU Extraction

Legal Document Reference Extraction

Version Number and Changelog Tracking

Social Media Profile Data Collection

Pro Tips

Common Errors

Frequently Asked Questions

Related Functions

SCRAPE_BY_CSS_PATH

SCRAPE_BY_XPATH

AI_SCRAPE

Start using SCRAPE_BY_REGEX today