StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHTML Agility Pack Screen Scraping XPATH isn't returning data
primarykey
Id
2500016
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2010-03-23T13:00:03.673
FavoriteCount
0
LastActivityDate
2012-04-24T09:12:24.783
LastEditDate
2010-03-23T15:37:04.907
LastEditorUserId
299912
OwnerUserId
299912
ParentId
0
PostTypeId
1
Score
2
ViewCount
3450
LastEditorDisplayName
text
Body
I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing. The page that I'm scraping currently is <a href="http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND" rel="nofollow noreferrer">http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND</a> The code I'm currently using is pretty quick and dirty... <pre><code> //This function retrieves data from the digikey private static List<string> ExtractProductInfo(HtmlDocument doc) { List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>(); List<string> m_unparsedProductInfo = new List<string>(); //Base Node for part info string m_baseNode = @"//html[1]/body[1]/div[2]"; //Write part info to list m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]")); //More lines of similar form will go here for more info //this retrieves digikey PN foreach(HtmlNode node in m_unparsedProductInfoNodes) { m_unparsedProductInfo.Add(node.InnerText); } return m_unparsedProductInfo; } </code></pre> Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes" Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.
Tags
<c#><screen-scraping><html-agility-pack><web-scraping>
Title
HTML Agility Pack Screen Scraping XPATH isn't returning data
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USMatthias
UserOwnerUserId
1. USMatthias
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POHTML Agility Pack Screen Scraping XPATH isn't returning data
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POHTML Agility Pack Screen Scraping XPATH isn't returning data
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.