StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUsing Html Agility Pack to grab text content
primarykey
Id
6824204
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2011-07-26T01:06:12.753
FavoriteCount
0
LastActivityDate
2012-07-26T09:07:58.337
LastEditDate
2017-05-23T10:34:20.993
LastEditorUserId
-1
OwnerUserId
860134
ParentId
0
PostTypeId
1
Score
1
ViewCount
1508
LastEditorDisplayName
text
Body
I will try my best to specific. Basically working on a crawler in vb.net whereby I am more interested in extracting text content of the page. My current application downloads the body of the html source in a textbox by using a web browser control as follows: <pre><code>Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim url As String = "<url>" WebBrowser1.Navigate(url) End Sub Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted TextBox2.Text = WebBrowser1.Document.Body.OuterHtml End Sub </code></pre> Now from here on, textbox2 consists of junk html which contains href,img,ads,script etc but I need to get ride of all these metadata and grab the plain text. I could apply regex properties to get ride of all the anomalies but i think HAP is much more appropriate for html parser. Searching on here brought me to this page which discusses the use of Whitelist technique mentioned by 'Meltdown' <a href="https://stackoverflow.com/questions/3107514/html-agility-pack-strip-tags-not-in-whitelist">HTML Agility Pack strip tags NOT IN whitelist</a> But how do I apply it in vb.net as it seems like a great idea? Please adivce guys.......... EDIT: I found a vb.net version of the code shown below, but there seems to be an error at <pre><code>If i IsNot DeletableNodesXpath.Count - 1 Then </code></pre> <blockquote> Errors: IsNot requires operand that have reference types, but this operand has the value type integer </blockquote> Here is the code: Public NotInheritable Class HtmlSanitizer Private Sub New() End Sub Private Shared ReadOnly Whitelist As IDictionary(Of String, String()) Private Shared DeletableNodesXpath As New List(Of String)() <pre><code>Shared Sub New() Whitelist = New Dictionary(Of String, String())() From { _ {"a", New () {"href"}}, _ {"strong", Nothing}, _ {"em", Nothing}, _ {"blockquote", Nothing}, _ {"b", Nothing}, _ {"p", Nothing}, _ {"ul", Nothing}, _ {"ol", Nothing}, _ {"li", Nothing}, _ {"div", New () {"align"}}, _ {"strike", Nothing}, _ {"u", Nothing}, _ {"sub", Nothing}, _ {"sup", Nothing}, _ {"table", Nothing}, _ {"tr", Nothing}, _ {"td", Nothing}, _ {"th", Nothing} _ } End Sub Public Shared Function Sanitize(input As String) As String If input.Trim().Length < 1 Then Return String.Empty End If Dim htmlDocument = New HtmlDocument() htmldocument.LoadHtml(input) SanitizeNode(htmldocument.DocumentNode) Dim xPath As String = HtmlSanitizer.CreateXPath() Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath) End Function Private Shared Sub SanitizeChildren(parentNode As HtmlNode) For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1 SanitizeNode(parentNode.ChildNodes(i)) Next End Sub Private Shared Sub SanitizeNode(node As HtmlNode) If node.NodeType = HtmlNodeType.Element Then If Not Whitelist.ContainsKey(node.Name) Then If Not DeletableNodesXpath.Contains(node.Name) Then 'DeletableNodesXpath.Add(node.Name.Replace("?","")); node.Name = "removeableNode" DeletableNodesXpath.Add(node.Name) End If If node.HasChildNodes Then SanitizeChildren(node) End If Return End If If node.HasAttributes Then For i As Integer = node.Attributes.Count - 1 To 0 Step -1 Dim currentAttribute As HtmlAttribute = node.Attributes(i) Dim allowedAttributes As String() = Whitelist(node.Name) If allowedAttributes IsNot Nothing Then If Not allowedAttributes.Contains(currentAttribute.Name) Then node.Attributes.Remove(currentAttribute) End If Else node.Attributes.Remove(currentAttribute) End If Next End If End If If node.HasChildNodes Then SanitizeChildren(node) End If End Sub Private Shared Function StripHtml(html As String, xPath As String) As String Dim htmlDoc As New HtmlDocument() htmlDoc.LoadHtml(html) If xPath.Length > 0 Then Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath) For Each node As HtmlNode In invalidNodes node.ParentNode.RemoveChild(node, True) Next End If Return htmlDoc.DocumentNode.WriteContentTo() End Function Private Shared Function CreateXPath() As String Dim _xPath As String = String.Empty For i As Integer = 0 To DeletableNodesXpath.Count - 1 If i IsNot DeletableNodesXpath.Count - 1 Then _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString()) Else _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString()) End If Next Return _xPath End Function End Class </code></pre> Please can somebody help??????
Tags
<html><vb.net><html-agility-pack>
Title
Using Html Agility Pack to grab text content
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCommunity
UserOwnerUserId
1. USKevin
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
2. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POUsing Html Agility Pack to grab text content
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.