Note that there are some explanatory texts on larger screens.

plurals
  1. POHTML Agility Pack Screen Scraping XPATH isn't returning data
    primarykey
    data
    text
    <p>I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.</p> <p>The page that I'm scraping currently is <a href="http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&amp;name=296-12602-1-ND" rel="nofollow noreferrer">http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&amp;name=296-12602-1-ND</a></p> <p>The code I'm currently using is pretty quick and dirty...</p> <pre><code> //This function retrieves data from the digikey private static List&lt;string&gt; ExtractProductInfo(HtmlDocument doc) { List&lt;HtmlNode&gt; m_unparsedProductInfoNodes = new List&lt;HtmlNode&gt;(); List&lt;string&gt; m_unparsedProductInfo = new List&lt;string&gt;(); //Base Node for part info string m_baseNode = @"//html[1]/body[1]/div[2]"; //Write part info to list m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]")); //More lines of similar form will go here for more info //this retrieves digikey PN foreach(HtmlNode node in m_unparsedProductInfoNodes) { m_unparsedProductInfo.Add(node.InnerText); } return m_unparsedProductInfo; } </code></pre> <p>Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"</p> <p>Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COA link to the HTML you're trying to scrape would help a lot. If it's sensitive data or not easily linkable, then an anonymized example would be useful. Anyway, you `m_baseName` is rightfully only returning a `div` element, because that's what the XPath expression is getting: the second `div` element in the first `body` element of any `html` element. There should only be one `html` element, so there should only be one `div` returned. What are you thinking this expression does?
      singulars
    2. COhttp://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND is a part example. After inspecting what it's returning, the DIV looks correct, what's strange is I can't reference anything past it without it being NULL. I've taken the HTML stored through LoadHtml and saved it. After inspecting that with Firebug everything looks like its where it should be, so I'm going to rule out UserAgent issues for the time being. For example when I want to find the DK part number I use the path: //html[1]/body[1]/div[2]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]. Return value is NULL
      singulars
    3. COAlso, when looking at the m_base div it returns, the only child of significance is CS=#### (seems to change with browser useragent settings) yet if I include that at all in my path I get the "Expression must evaluate to a node-set." error message.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload