SCRAPE_BY_XPATH
Scrape content using XPath expression (uses headless browser for JS rendering).
=SCRAPE_BY_XPATH(url, xpath)Returns: string or 2D array (multiple matches returned as separate rows)
Overview
SCRAPE_BY_XPATH extracts content from webpages using XPath expressions, a powerful query language for navigating XML and HTML document structures. XPath provides capabilities beyond CSS selectors, including the ability to traverse up the document tree, select elements by their text content, use conditional logic, and perform calculations within the query itself. Every SCRAPE_BY_XPATH call uses a headless browser for full JavaScript rendering, making it the go-to choice for scraping modern dynamic websites.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The full URL of the webpage to scrape (must include https:// or http://). |
xpath | string | Yes | XPath expression targeting the element(s) to extract. Examples: "//h1" (all h1 elements), "//div[@class='content']" (divs with class "content"), "//a/@href" (all link URLs), "//table//tr/td[2]" (second column of all table rows). |
Examples
Extract the main heading from a page
Selects the first h1 element on the page and returns its text content. The double slash "//" means "find anywhere in the document."
=SCRAPE_BY_XPATH("https://example.com", "//h1")Output
Example DomainGet all links containing specific text
Uses the contains() function to find anchor elements whose text includes "Read More" and extracts their href attributes. Demonstrates XPath text-based selection that CSS cannot replicate.
=SCRAPE_BY_XPATH("https://blog.example.com", "//a[contains(text(), 'Read More')]/@href")Output
| /blog/post-1 |
| /blog/post-2 |
| /blog/post-3 |
Extract second column from a data table
Navigates to a specific table by its ID, then selects the second cell (td[2]) from every row. Perfect for extracting a single column of tabular data.
=SCRAPE_BY_XPATH("https://data.example.com/stats", "//table[@id='stats']//tr/td[2]")Output
| 1,250 |
| 3,870 |
| 945 |
| 12,300 |
Scrape product names from elements with specific data attributes
Targets only in-stock products by filtering on a data attribute, then extracts the h3 element within each matching container.
=SCRAPE_BY_XPATH("https://shop.example.com", "//div[@data-available='true']//h3")Output
| Bluetooth Speaker |
| Wireless Mouse |
| Phone Stand |
Get the last paragraph on a page
Uses the XPath last() function to select only the final paragraph element. Demonstrates positional selection capabilities unique to XPath.
=SCRAPE_BY_XPATH("https://example.com/about", "(//p)[last()]")Output
Contact us for more information at info@example.com.Use Cases
Financial Data Extraction
Extract stock prices, financial metrics, and market data from financial portals and investor relations pages. XPath enables precise targeting of specific table cells and dynamically loaded financial widgets.
Government Data Collection
Scrape public records, regulatory filings, and statistical reports from government websites. XPath handles the complex table structures and nested document formats commonly used on government portals.
Content Migration Audits
Audit legacy website content before migration by extracting text, images, metadata, and internal links using XPath. Build comprehensive content inventories that map old URLs to their extracted assets.
Competitor Feature Comparison
Scrape competitor pricing pages and feature lists to build comparison matrices. Use XPath to extract feature names, availability indicators, and pricing tier details into organized spreadsheets.
News and Media Monitoring
Monitor news sites and press release databases for brand mentions, extracting article titles, publication dates, and author names. XPath text-based selection helps filter for articles mentioning specific keywords.
Pro Tips
Use the browser console shortcut $x("//your/xpath") to quickly test XPath expressions before using them in your spreadsheet. This is faster than the document.evaluate() method.
When class names contain spaces (multiple classes), use contains(@class, "target-class") instead of @class="target-class" to match elements that have the target class among several classes.
Use the pipe operator | to combine multiple XPath expressions in a single query: "//h1 | //h2 | //h3" selects all top-level headings. Results are returned in document order.
For tables, use "//table[1]//tr/td[1]" to get the first column and "//table[1]//tr/td[2]" to get the second column. Place these in adjacent spreadsheet columns to reconstruct the table structure.
XPath expressions allow you to target elements with precision that CSS selectors cannot match. You can select elements based on their text content (//a[contains(text(), "Buy Now")]), navigate to parent or sibling elements (//span[@class="price"]/parent::div), combine multiple conditions (//div[@class="product" and @data-available="true"]), and even use functions like position(), last(), and string-length() within your queries.
Because SCRAPE_BY_XPATH always renders JavaScript, it reliably handles single-page applications, dynamically loaded content, infinite scroll sections, and client-side rendered frameworks like React, Vue, and Angular. This makes it slightly slower than SCRAPE_BY_CSS_PATH in standard mode but ensures you always get the fully rendered page content.
The function returns a single string when one element matches, or a 2D array with one value per row when multiple elements match. This output format integrates naturally with Google Sheets, allowing you to use the results with FILTER, SORT, UNIQUE, and other array functions for further data processing.
SCRAPE_BY_XPATH is particularly valuable for data analysts, SEO professionals, and researchers who need precise control over which elements they extract and are comfortable with XPath syntax.
Common Errors
No matches foundCause: The XPath expression does not match any elements on the rendered page. This can happen if the expression has a syntax error, the target element is inside an iframe or Shadow DOM, or the class names are dynamically generated.
Fix: Test the XPath in your browser console using $x("your-xpath-here"). Check for typos in element names and attribute values. If the element is inside an iframe, try scraping the iframe URL directly.
Error: URL and XPath are requiredCause: One or both required parameters are empty or missing.
Fix: Ensure both the URL (with protocol) and XPath expression are provided as non-empty strings. Check that cell references point to cells with values.
Error: XPath evaluation failedCause: The XPath expression contains a syntax error that prevents it from being evaluated. Common issues include unmatched quotes, invalid function names, or malformed predicates.
Fix: Review your XPath expression for syntax errors. Ensure quotes are properly matched (use single quotes inside double quotes or vice versa). Verify function names are spelled correctly. Test the expression in the browser console first.
Frequently Asked Questions
CSS selectors and XPath both target HTML elements, but they differ in capabilities. CSS selectors are simpler and work well for selecting elements by class, ID, or tag name (e.g., ".price", "#header", "h1"). XPath is more powerful and can: traverse up the document tree (select a parent based on a child), filter by text content (//a[contains(text(), "Buy")]), use logical conditions (and/or), select by position (//li[3]), and use functions like string-length() and normalize-space(). Use CSS selectors for simple extraction and XPath when you need advanced querying capabilities.
Yes. Unlike SCRAPE_BY_CSS_PATH which offers JavaScript rendering as an optional parameter, SCRAPE_BY_XPATH always uses a headless browser that fully executes JavaScript before evaluating the XPath expression. This means it works reliably on all types of websites including single-page applications, but it is slower than SCRAPE_BY_CSS_PATH in standard (non-JS) mode. If speed is a priority and the target page does not require JavaScript rendering, consider using SCRAPE_BY_CSS_PATH instead.
Append /@attributeName to your XPath expression. For example, to get all image sources: "//img/@src". To get href attributes from links: "//a/@href". To get the value of a custom data attribute: "//div/@data-product-id". You can also combine attribute extraction with filters: "//a[@class='external']/@href" gets href values only from links with the class "external."
Yes, this is one of XPath's most powerful features. Use text() to match text content: "//a[text()='Click Here']" matches links with the exact text "Click Here". Use contains() for partial matches: "//p[contains(text(), 'price')]" matches paragraphs containing the word "price". Use starts-with() for prefix matching: "//div[starts-with(@class, 'product-')]" matches divs whose class starts with "product-". These text-based selectors are not available with CSS selectors.
Common causes include: (1) The element is inside an iframe, which is a separate document that the XPath cannot reach. (2) The element is inside a Shadow DOM component, which creates an encapsulated DOM tree. (3) The XPath syntax has an error, such as incorrect quoting or namespace issues. (4) The page uses dynamic class names that change on each load (common with CSS-in-JS libraries). Test your XPath in the browser console using document.evaluate() or the $x() shortcut: $x("//your/xpath/here") to verify it matches the expected elements.
Most modern HTML pages do not require namespace handling, and the scraper processes them as standard HTML. However, if you encounter namespace issues (typically with XML or XHTML strict documents), try using the local-name() function in your XPath: "//*[local-name()='div']" instead of "//div". This ignores namespace prefixes and matches elements by their local tag name only.
XPath provides many built-in functions: position() returns element index (//li[position()<=3] gets first 3 list items); last() selects the last element ((//p)[last()]); count() counts elements (//ul[count(li)>5] selects lists with more than 5 items); normalize-space() trims whitespace; translate() converts characters; concat() joins strings; and not() negates conditions (//div[not(@class="hidden")] selects visible divs). These functions can be combined for complex queries.
Related Functions
Start using SCRAPE_BY_XPATH today
Install Unlimited Sheets to get SCRAPE_BY_XPATH and 41 other powerful functions in Google Sheets.