SCRAPE_BY_REGEX
Scrape content from a webpage using a regular expression pattern.
=SCRAPE_BY_REGEX(url, regex, [removeHtml], [group], [renderJs])Returns: string or 2D array (multiple matches returned as separate rows)
Overview
SCRAPE_BY_REGEX extracts content from webpages by applying a regular expression pattern against the page source. Unlike CSS or XPath selectors that navigate the document structure, regex operates directly on the raw HTML or text content, making it uniquely suited for extracting specific patterns like prices, phone numbers, email addresses, dates, product codes, API keys, and any other text that follows a predictable format.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The full URL of the webpage to scrape (must include https:// or http://). |
regex | string | Yes | Regular expression pattern to match against the page content. Use standard regex syntax. Backslashes must be doubled in Sheets formulas (e.g., "\\d+" for digits). |
removeHtml | boolean | No (FALSE) | Optional. Set to TRUE to strip all HTML tags before applying the regex, leaving only visible text content. Default is FALSE (regex applied to raw HTML source). |
group | number | No (0) | Optional. The capture group to return. 0 returns the full match, 1 returns the first capture group, 2 the second, etc. Default is 0. |
renderJs | boolean | No | Optional. Set to TRUE to render JavaScript before applying the regex. Required for dynamically loaded content. Slower than standard mode. |
Examples
Extract all prices from a page
Matches dollar amounts in the format $XX.XX across the entire page. Returns all matches as separate rows, making it easy to see all prices at a glance.
=SCRAPE_BY_REGEX("https://store.example.com/deals", "\$[0-9,]+\.\d{2}")Output
| $49.99 |
| $129.00 |
| $24.95 |
| $89.50 |
Extract email addresses from a contact page
Uses a standard email regex pattern with removeHtml enabled to find all email addresses on the page regardless of how they are wrapped in HTML.
=SCRAPE_BY_REGEX("https://example.com/contact", "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", TRUE)Output
| info@example.com |
| support@example.com |
| sales@example.com |
Extract values using a capture group
Matches the pattern "SKU: ABC-123" and returns only the capture group (the SKU code itself, without the "SKU:" prefix). Group 1 isolates the value inside the parentheses.
=SCRAPE_BY_REGEX("https://store.example.com/product/123", "SKU:\s*([A-Z0-9-]+)", TRUE, 1)Output
WH-1000XM5Extract phone numbers from business listings
Matches US phone numbers in the format (XXX) XXX-XXXX. The removeHtml parameter ensures phone numbers split across HTML tags are properly captured.
=SCRAPE_BY_REGEX("https://directory.example.com/plumbers", "\(\d{3}\)\s*\d{3}-\d{4}", TRUE)Output
| (555) 123-4567 |
| (555) 987-6543 |
| (555) 246-8135 |
Use Cases
Price Monitoring Across Multiple Retailers
Track product prices across different retailers using regex patterns that match each site's price format. Works even when HTML structures differ between sites, since the regex targets the price pattern itself rather than the surrounding markup.
Lead Generation from Business Directories
Extract contact information (emails, phone numbers, addresses) from business directories and yellow pages. Regex patterns reliably match structured data formats like phone numbers and emails regardless of how they are displayed on the page.
Product Catalog SKU Extraction
Scrape product codes, SKUs, and model numbers from supplier catalogs and manufacturer websites. Use regex to match company-specific code formats (e.g., "XX-####-YY") and build comprehensive product databases.
Legal Document Reference Extraction
Extract case numbers, statute references, and regulation citations from legal databases and court websites. Regex precisely matches citation formats like "42 U.S.C. 1983" or case docket numbers.
Version Number and Changelog Tracking
Monitor software release pages and changelogs to track version numbers, release dates, and update frequencies. Regex patterns like "v\d+\.\d+\.\d+" reliably extract semantic version numbers from any page format.
Social Media Profile Data Collection
Extract follower counts, engagement metrics, and profile information from public social media pages. Regex handles the various number formats (1.2K, 1,200, 1200) used across different platforms.
Pro Tips
Always test your regex at regex101.com before using it in the spreadsheet. Paste the HTML source (or text content if using removeHtml) as the test string to verify your pattern matches correctly.
Use non-greedy quantifiers (*? and +?) instead of greedy ones (* and +) when matching content between delimiters. For example, "<b>.*?</b>" matches individual bold elements, while "<b>.*</b>" matches everything from the first <b> to the last </b>.
Combine removeHtml with simple text patterns for the most reliable results. Regex on clean text is much easier to write and maintain than regex on raw HTML with all its attributes and nested tags.
Use capture groups strategically: wrap only the part you want to extract in parentheses and set the group parameter accordingly. This lets you use surrounding text as context anchors without including it in the output.
For extracting structured data like tables, consider using SCRAPE_BY_CSS_PATH or SCRAPE_BY_XPATH instead. Regex works best for extracting specific data patterns (numbers, codes, emails) rather than navigating document structure.
The function fetches the target URL, optionally strips HTML tags (when removeHtml is TRUE), and then applies your regex pattern. All matches are returned, with multiple matches appearing as separate rows in your spreadsheet. If you use capture groups in your pattern, specify which group to return with the group parameter, where 0 returns the full match and 1 returns the first capture group.
The removeHtml parameter is particularly useful when the data you need spans across HTML tags. For example, a price might be rendered as "<span class="currency">$</span><span class="amount">29.99</span>", which is difficult to match with a single regex on raw HTML. Setting removeHtml to TRUE collapses this into "$29.99", making your regex pattern simpler and more reliable.
For websites that generate content with JavaScript, enable the renderJs parameter to use a headless browser. This ensures the regex is applied against the fully rendered page content rather than just the initial HTML source.
SCRAPE_BY_REGEX excels in scenarios where the target data does not have consistent HTML structure but does have a consistent text format. It is also valuable as a fallback when CSS and XPath selectors fail due to obfuscated or dynamically generated markup. Common applications include price monitoring, contact information extraction, SKU and product code collection, and data validation workflows.
Common Errors
No matches foundCause: The regex pattern does not match any content on the page. This can happen if the pattern is too specific, the content is rendered via JavaScript, or HTML tags are interfering with text matching.
Fix: Try enabling removeHtml to match against plain text instead of raw HTML. Verify your regex at regex101.com using the page source as the test string. If the content is dynamically loaded, enable renderJs.
Error: URL and regex are requiredCause: One or both required parameters are missing or empty.
Fix: Ensure both the URL (including protocol) and regex pattern are provided as non-empty strings. Double-check that cell references point to cells with values.
Error: Invalid regular expressionCause: The regex pattern contains a syntax error such as unmatched parentheses, invalid escape sequences, or malformed character classes.
Fix: Validate your regex at regex101.com. Common issues include unescaped special characters (parentheses, brackets, dots need backslashes if used literally), unmatched group delimiters, and invalid quantifier positions.
Frequently Asked Questions
In Google Sheets formulas, strings are enclosed in double quotes. Since regex uses backslashes for special characters (\d for digits, \s for whitespace), and the Sheets formula parser does not interpret backslashes as escape characters inside strings, you write them as single backslashes: "\d+" matches one or more digits. If you are seeing unexpected behavior, try the pattern with removeHtml set to TRUE to simplify the content being matched. Test your regex at regex101.com before using it in the function.
Group 0 (the default) returns the entire matched text. Group 1 returns only the text inside the first set of parentheses in your pattern. For example, with the pattern "Price: (\$[0-9.]+)" applied to the text "Price: $29.99", group 0 returns "Price: $29.99" and group 1 returns "$29.99". This is useful when you need to match a pattern for context but only extract part of it. You can have multiple capture groups (group 2, group 3, etc.) by adding more parenthesized sections to your pattern.
Enable removeHtml (set to TRUE) when: (1) The text you want to match is split across multiple HTML tags, such as a price displayed as "<span>$</span><span>29</span>.<span>99</span>". (2) HTML tags are interfering with your pattern matches. (3) You want to match against the visible text content only, ignoring all markup. (4) Your regex is designed for plain text, not HTML. Leave it as FALSE when you specifically need to match HTML attributes, tag names, or markup structure.
The regex engine supports inline flags using the (?flags) syntax at the beginning of your pattern. Use "(?i)" for case-insensitive matching, "(?s)" for single-line mode (dot matches newlines), and "(?m)" for multiline mode. For example, "(?i)price:\s*\$[0-9.]+" matches "Price:", "PRICE:", and "price:" variants. You can combine flags: "(?im)" enables both case-insensitive and multiline modes.
By default, the regex is applied to the raw HTML source, which includes all tags, attributes, and markup. For example, a pattern like "[A-Z][a-z]+" intended to match capitalized words might also match tag names like "Div" or "Span". Set removeHtml to TRUE to strip all HTML tags first, so your regex only sees the visible text content. Alternatively, make your regex more specific to exclude HTML patterns.
The regex engine supports full PCRE-compatible syntax including lookaheads, lookbehinds, non-capturing groups, lazy quantifiers, and character classes. However, extremely complex patterns with excessive backtracking (such as nested quantifiers like "(a+)+" ) can cause timeout errors. Keep patterns as simple and specific as possible. If you find yourself writing very long regex patterns, consider using SCRAPE_BY_CSS_PATH or SCRAPE_BY_XPATH to narrow down the content first, then use regex on the result.
REGEXEXTRACT is a native Google Sheets function that applies a regex to text already in your spreadsheet. SCRAPE_BY_REGEX fetches a webpage and applies the regex to the page content, combining web fetching and extraction in a single step. You could achieve similar results by using SCRAPE_BY_CSS_PATH to get the page text and then REGEXEXTRACT on the result, but SCRAPE_BY_REGEX is more efficient as it processes everything server-side in one request and can return multiple matches.
By default, the dot (.) in regex does not match newline characters. If you need to match across line breaks, use the inline flag "(?s)" at the start of your pattern to enable single-line mode, where dot matches any character including newlines. For example, "(?s)<div class="product">.*?</div>" matches a div and all its content across multiple lines. The (?m) flag makes ^ and $ match the start/end of each line rather than the entire string.
Related Functions
Start using SCRAPE_BY_REGEX today
Install Unlimited Sheets to get SCRAPE_BY_REGEX and 41 other powerful functions in Google Sheets.