Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I did some html parsing with hxt a few weeks ago and thought, that <a href="http://www.w3schools.com/xpath/xpath_syntax.asp" rel="nofollow">xpath</a> comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.</p> <pre><code>import Text.XML.HXT.Core import Text.XML.HXT.XPath.Arrows type XmlTreeValue a = a XmlTree String type ParsedXmlTree a = a XmlTree XmlTree type IOXmlTree = IOSArrow XmlTree XmlTree -- parses a given .html file parseHtml :: FilePath -&gt; IOStateArrow s b XmlTree parseHtml path = readDocument [withParseHTML yes, withWarnings no] path -- "" for stdout saveHtml :: IOXmlTree saveHtml = writeDocument [withIndent yes] "" extract :: IOXmlTree extract = processChildren (process `when` isElem) -- main processing functon processHtml :: FilePath -&gt; IO () processHtml src = runX (parseHtml src &gt;&gt;&gt; extract &gt;&gt;&gt; saveHtml) &gt;&gt; return () -- process the html structure process :: ArrowXml cat =&gt; ParsedXmlTree cat process = -- create tag &lt;structure&gt; for the expression given next selem "structure" -- navigate to &lt;html&gt;&lt;body&gt;&lt;table&gt;&lt;tr&gt;... [(getXPathTrees "/html/body/table/tr") -- then combine the results &gt;&gt;&gt; (getTheName &lt;+&gt; getWhere)] -- selects text at path &lt;td&gt;&lt;font&gt;&lt;a...&gt; &lt;/a&gt;&lt;/font&gt;&lt;/td&gt; and creates &lt;name&gt;-Tag -- (// means that all &lt;td&gt;-tags are analysed, -- but I'm not quite sure why this is relevant here) getTheName :: ArrowXml cat =&gt; ParsedXmlTree cat getTheName = selem "name" [getXPathTrees "//td/font/a/text()"] -- selects text at path &lt;td&gt;&lt;font&gt;&lt;a...&gt; &lt;/a&gt;&lt;/font&gt;&lt;/td&gt; -- (where the forth font-tag is taken) and creates &lt;where&gt;-Tag getWhere :: ArrowXml cat =&gt; ParsedXmlTree cat getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"] </code></pre> <p>The result looks like this:</p> <pre><code>*Main&gt; processHtml "test.html" &lt;?xml version="1.0" encoding="UTF-8"?&gt; &lt;structure&gt; &lt;name&gt;ABC&lt;/name&gt; &lt;where/&gt; &lt;name/&gt; &lt;where&gt;Here&lt;/where&gt; &lt;name&gt;EFG&lt;/name&gt; &lt;where/&gt; &lt;name/&gt; &lt;where&gt;There&lt;/where&gt; &lt;name&gt;HIJ&lt;/name&gt; &lt;where/&gt; &lt;name/&gt; &lt;where&gt;Far away&lt;/where&gt; &lt;/structure&gt; </code></pre> <p>Like I said, not quite perfect, but hopefully a start.</p> <p>EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.</p> <pre><code>processHtml :: FilePath -&gt; IO [(String,String)] processHtml src = do names &lt;- runX (parseHtml src &gt;&gt;&gt; process1) fontTags &lt;- runX (parseHtml src &gt;&gt;&gt; process2) let wheres = filterAfterWhere fontTags let result = zip names wheres return result where filterAfterWhere [] = [] filterAfterWhere xs = case dropWhile (/= "Where:") xs of [] -&gt; [] [x] -&gt; [x] _:y:ys -&gt; y : filterAfterWhere ys process1 :: ArrowXml cat =&gt; XmlTreeValue cat process1 = textNodeToText getTheName process2 :: ArrowXml cat =&gt; XmlTreeValue cat process2 = textNodeToText getWhere getTheName :: ArrowXml cat =&gt; ParsedXmlTree cat getTheName = getXPathTrees "//td/font/a/text()" getWhere :: ArrowXml cat =&gt; ParsedXmlTree cat getWhere = getXPathTrees "//td/font/text()" -- neet function to select a value within a XmlTree as String textNodeToText :: ArrowXml cat =&gt; ParsedXmlTree cat -&gt; XmlTreeValue cat textNodeToText selector = selector `when` isElem &gt;&gt;&gt; getText </code></pre> <p>This way, you get the result you showed in your question:</p> <pre><code>*Main&gt; processHtml "test.html" [("ABC","Here"),("EFG","There"),("HIJ","Far away")] </code></pre> <p>Edit2:</p> <p>Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. <a href="http://www.freeformatter.com/xpath-tester.html#ad-output" rel="nofollow">An online XPath-evaluator</a> shows the right behaviour for <code>//td/font[4]/text()</code>.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload