Note that there are some explanatory texts on larger screens.

plurals
  1. POUsing Html Agility Pack to grab text content
    primarykey
    data
    text
    <p>I will try my best to specific. Basically working on a crawler in vb.net whereby I am more interested in extracting text content of the page. My current application downloads the body of the html source in a textbox by using a web browser control as follows:</p> <pre><code>Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim url As String = "&lt;url&gt;" WebBrowser1.Navigate(url) End Sub Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted TextBox2.Text = WebBrowser1.Document.Body.OuterHtml End Sub </code></pre> <p>Now from here on, textbox2 consists of junk html which contains href,img,ads,script etc but I need to get ride of all these metadata and grab the plain text.</p> <p>I could apply regex properties to get ride of all the anomalies but i think HAP is much more appropriate for html parser.</p> <p>Searching on here brought me to this page which discusses the use of <strong>Whitelist</strong> technique mentioned by 'Meltdown'</p> <p><a href="https://stackoverflow.com/questions/3107514/html-agility-pack-strip-tags-not-in-whitelist">HTML Agility Pack strip tags NOT IN whitelist</a></p> <p>But how do I apply it in vb.net as it seems like a great idea? </p> <p>Please adivce guys..........</p> <p><strong>EDIT:</strong> I found a vb.net version of the code shown below, but there seems to be an error at</p> <pre><code>If i IsNot DeletableNodesXpath.Count - 1 Then </code></pre> <blockquote> <p>Errors: IsNot requires operand that have reference types, but this operand has the value type integer</p> </blockquote> <p>Here is the code:</p> <p>Public NotInheritable Class HtmlSanitizer Private Sub New() End Sub Private Shared ReadOnly Whitelist As IDictionary(Of String, String()) Private Shared DeletableNodesXpath As New List(Of String)()</p> <pre><code>Shared Sub New() Whitelist = New Dictionary(Of String, String())() From { _ {"a", New () {"href"}}, _ {"strong", Nothing}, _ {"em", Nothing}, _ {"blockquote", Nothing}, _ {"b", Nothing}, _ {"p", Nothing}, _ {"ul", Nothing}, _ {"ol", Nothing}, _ {"li", Nothing}, _ {"div", New () {"align"}}, _ {"strike", Nothing}, _ {"u", Nothing}, _ {"sub", Nothing}, _ {"sup", Nothing}, _ {"table", Nothing}, _ {"tr", Nothing}, _ {"td", Nothing}, _ {"th", Nothing} _ } End Sub Public Shared Function Sanitize(input As String) As String If input.Trim().Length &lt; 1 Then Return String.Empty End If Dim htmlDocument = New HtmlDocument() htmldocument.LoadHtml(input) SanitizeNode(htmldocument.DocumentNode) Dim xPath As String = HtmlSanitizer.CreateXPath() Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath) End Function Private Shared Sub SanitizeChildren(parentNode As HtmlNode) For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1 SanitizeNode(parentNode.ChildNodes(i)) Next End Sub Private Shared Sub SanitizeNode(node As HtmlNode) If node.NodeType = HtmlNodeType.Element Then If Not Whitelist.ContainsKey(node.Name) Then If Not DeletableNodesXpath.Contains(node.Name) Then 'DeletableNodesXpath.Add(node.Name.Replace("?","")); node.Name = "removeableNode" DeletableNodesXpath.Add(node.Name) End If If node.HasChildNodes Then SanitizeChildren(node) End If Return End If If node.HasAttributes Then For i As Integer = node.Attributes.Count - 1 To 0 Step -1 Dim currentAttribute As HtmlAttribute = node.Attributes(i) Dim allowedAttributes As String() = Whitelist(node.Name) If allowedAttributes IsNot Nothing Then If Not allowedAttributes.Contains(currentAttribute.Name) Then node.Attributes.Remove(currentAttribute) End If Else node.Attributes.Remove(currentAttribute) End If Next End If End If If node.HasChildNodes Then SanitizeChildren(node) End If End Sub Private Shared Function StripHtml(html As String, xPath As String) As String Dim htmlDoc As New HtmlDocument() htmlDoc.LoadHtml(html) If xPath.Length &gt; 0 Then Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath) For Each node As HtmlNode In invalidNodes node.ParentNode.RemoveChild(node, True) Next End If Return htmlDoc.DocumentNode.WriteContentTo() End Function Private Shared Function CreateXPath() As String Dim _xPath As String = String.Empty For i As Integer = 0 To DeletableNodesXpath.Count - 1 If i IsNot DeletableNodesXpath.Count - 1 Then _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString()) Else _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString()) End If Next Return _xPath End Function End Class </code></pre> <p>Please can somebody help??????</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload