🌐 Web ScrapingPro Plan

SCRAPE_BY_XPATH

Scrape content using XPath expression (uses headless browser for JS rendering).

Formula Signature
=SCRAPE_BY_XPATH(url, xpath)

Returns: string or 2D array (multiple matches returned as separate rows)

Overview

SCRAPE_BY_XPATH extracts content from webpages using XPath expressions, a powerful query language for navigating XML and HTML document structures. XPath provides capabilities beyond CSS selectors, including the ability to traverse up the document tree, select elements by their text content, use conditional logic, and perform calculations within the query itself. Every SCRAPE_BY_XPATH call uses a headless browser for full JavaScript rendering, making it the go-to choice for scraping modern dynamic websites.

Parameters

ParameterTypeRequiredDescription
urlstringYesThe full URL of the webpage to scrape (must include https:// or http://).
xpathstringYesXPath expression targeting the element(s) to extract. Examples: "//h1" (all h1 elements), "//div[@class='content']" (divs with class "content"), "//a/@href" (all link URLs), "//table//tr/td[2]" (second column of all table rows).

Examples

1

Extract the main heading from a page

Selects the first h1 element on the page and returns its text content. The double slash "//" means "find anywhere in the document."

fx
=SCRAPE_BY_XPATH("https://example.com", "//h1")

Output

Example Domain
2

Get all links containing specific text

Uses the contains() function to find anchor elements whose text includes "Read More" and extracts their href attributes. Demonstrates XPath text-based selection that CSS cannot replicate.

fx
=SCRAPE_BY_XPATH("https://blog.example.com", "//a[contains(text(), 'Read More')]/@href")

Output

/blog/post-1
/blog/post-2
/blog/post-3
3

Extract second column from a data table

Navigates to a specific table by its ID, then selects the second cell (td[2]) from every row. Perfect for extracting a single column of tabular data.

fx
=SCRAPE_BY_XPATH("https://data.example.com/stats", "//table[@id='stats']//tr/td[2]")

Output

1,250
3,870
945
12,300
4

Scrape product names from elements with specific data attributes

Targets only in-stock products by filtering on a data attribute, then extracts the h3 element within each matching container.

fx
=SCRAPE_BY_XPATH("https://shop.example.com", "//div[@data-available='true']//h3")

Output

Bluetooth Speaker
Wireless Mouse
Phone Stand
5

Get the last paragraph on a page

Uses the XPath last() function to select only the final paragraph element. Demonstrates positional selection capabilities unique to XPath.

fx
=SCRAPE_BY_XPATH("https://example.com/about", "(//p)[last()]")

Output

Contact us for more information at info@example.com.

Use Cases

Finance

Financial Data Extraction

Extract stock prices, financial metrics, and market data from financial portals and investor relations pages. XPath enables precise targeting of specific table cells and dynamically loaded financial widgets.

Public Sector

Government Data Collection

Scrape public records, regulatory filings, and statistical reports from government websites. XPath handles the complex table structures and nested document formats commonly used on government portals.

Web Development

Content Migration Audits

Audit legacy website content before migration by extracting text, images, metadata, and internal links using XPath. Build comprehensive content inventories that map old URLs to their extracted assets.

Product Management

Competitor Feature Comparison

Scrape competitor pricing pages and feature lists to build comparison matrices. Use XPath to extract feature names, availability indicators, and pricing tier details into organized spreadsheets.

Public Relations

News and Media Monitoring

Monitor news sites and press release databases for brand mentions, extracting article titles, publication dates, and author names. XPath text-based selection helps filter for articles mentioning specific keywords.

Pro Tips

TIP

Use the browser console shortcut $x("//your/xpath") to quickly test XPath expressions before using them in your spreadsheet. This is faster than the document.evaluate() method.

TIP

When class names contain spaces (multiple classes), use contains(@class, "target-class") instead of @class="target-class" to match elements that have the target class among several classes.

TIP

Use the pipe operator | to combine multiple XPath expressions in a single query: "//h1 | //h2 | //h3" selects all top-level headings. Results are returned in document order.

TIP

For tables, use "//table[1]//tr/td[1]" to get the first column and "//table[1]//tr/td[2]" to get the second column. Place these in adjacent spreadsheet columns to reconstruct the table structure.

XPath expressions allow you to target elements with precision that CSS selectors cannot match. You can select elements based on their text content (//a[contains(text(), "Buy Now")]), navigate to parent or sibling elements (//span[@class="price"]/parent::div), combine multiple conditions (//div[@class="product" and @data-available="true"]), and even use functions like position(), last(), and string-length() within your queries.

Because SCRAPE_BY_XPATH always renders JavaScript, it reliably handles single-page applications, dynamically loaded content, infinite scroll sections, and client-side rendered frameworks like React, Vue, and Angular. This makes it slightly slower than SCRAPE_BY_CSS_PATH in standard mode but ensures you always get the fully rendered page content.

The function returns a single string when one element matches, or a 2D array with one value per row when multiple elements match. This output format integrates naturally with Google Sheets, allowing you to use the results with FILTER, SORT, UNIQUE, and other array functions for further data processing.

SCRAPE_BY_XPATH is particularly valuable for data analysts, SEO professionals, and researchers who need precise control over which elements they extract and are comfortable with XPath syntax.

Common Errors

No matches found

Cause: The XPath expression does not match any elements on the rendered page. This can happen if the expression has a syntax error, the target element is inside an iframe or Shadow DOM, or the class names are dynamically generated.

Fix: Test the XPath in your browser console using $x("your-xpath-here"). Check for typos in element names and attribute values. If the element is inside an iframe, try scraping the iframe URL directly.

Error: URL and XPath are required

Cause: One or both required parameters are empty or missing.

Fix: Ensure both the URL (with protocol) and XPath expression are provided as non-empty strings. Check that cell references point to cells with values.

Error: XPath evaluation failed

Cause: The XPath expression contains a syntax error that prevents it from being evaluated. Common issues include unmatched quotes, invalid function names, or malformed predicates.

Fix: Review your XPath expression for syntax errors. Ensure quotes are properly matched (use single quotes inside double quotes or vice versa). Verify function names are spelled correctly. Test the expression in the browser console first.

Frequently Asked Questions

CSS selectors and XPath both target HTML elements, but they differ in capabilities. CSS selectors are simpler and work well for selecting elements by class, ID, or tag name (e.g., ".price", "#header", "h1"). XPath is more powerful and can: traverse up the document tree (select a parent based on a child), filter by text content (//a[contains(text(), "Buy")]), use logical conditions (and/or), select by position (//li[3]), and use functions like string-length() and normalize-space(). Use CSS selectors for simple extraction and XPath when you need advanced querying capabilities.

Yes. Unlike SCRAPE_BY_CSS_PATH which offers JavaScript rendering as an optional parameter, SCRAPE_BY_XPATH always uses a headless browser that fully executes JavaScript before evaluating the XPath expression. This means it works reliably on all types of websites including single-page applications, but it is slower than SCRAPE_BY_CSS_PATH in standard (non-JS) mode. If speed is a priority and the target page does not require JavaScript rendering, consider using SCRAPE_BY_CSS_PATH instead.

Append /@attributeName to your XPath expression. For example, to get all image sources: "//img/@src". To get href attributes from links: "//a/@href". To get the value of a custom data attribute: "//div/@data-product-id". You can also combine attribute extraction with filters: "//a[@class='external']/@href" gets href values only from links with the class "external."

Yes, this is one of XPath's most powerful features. Use text() to match text content: "//a[text()='Click Here']" matches links with the exact text "Click Here". Use contains() for partial matches: "//p[contains(text(), 'price')]" matches paragraphs containing the word "price". Use starts-with() for prefix matching: "//div[starts-with(@class, 'product-')]" matches divs whose class starts with "product-". These text-based selectors are not available with CSS selectors.

Common causes include: (1) The element is inside an iframe, which is a separate document that the XPath cannot reach. (2) The element is inside a Shadow DOM component, which creates an encapsulated DOM tree. (3) The XPath syntax has an error, such as incorrect quoting or namespace issues. (4) The page uses dynamic class names that change on each load (common with CSS-in-JS libraries). Test your XPath in the browser console using document.evaluate() or the $x() shortcut: $x("//your/xpath/here") to verify it matches the expected elements.

Most modern HTML pages do not require namespace handling, and the scraper processes them as standard HTML. However, if you encounter namespace issues (typically with XML or XHTML strict documents), try using the local-name() function in your XPath: "//*[local-name()='div']" instead of "//div". This ignores namespace prefixes and matches elements by their local tag name only.

XPath provides many built-in functions: position() returns element index (//li[position()<=3] gets first 3 list items); last() selects the last element ((//p)[last()]); count() counts elements (//ul[count(li)>5] selects lists with more than 5 items); normalize-space() trims whitespace; translate() converts characters; concat() joins strings; and not() negates conditions (//div[not(@class="hidden")] selects visible divs). These functions can be combined for complex queries.

Related Functions

Start using SCRAPE_BY_XPATH today

Install Unlimited Sheets to get SCRAPE_BY_XPATH and 41 other powerful functions in Google Sheets.