Note that there are some explanatory texts on larger screens.

plurals
  1. POParsing fixed-format data embedded in HTML in python
    text
    copied!<p>I am using google's appengine api</p> <pre><code>from google.appengine.api import urlfetch </code></pre> <p>to fetch a webpage. The result of</p> <pre><code>result = urlfetch.fetch("http://www.example.com/index.html") </code></pre> <p>is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.</p> <p><strong>EDIT:</strong> Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way... <strong>END EDIT</strong></p> <p>If the document is something like this:</p> <pre><code>&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt; AAA 123 888 2008-10-30 ABC BBB 987 332 2009-01-02 JSE ... A4A 288 AAA &lt;/body&gt;&lt;/html&gt; </code></pre> <p>result.content will be this, after urlfetch fetches it:</p> <pre><code>'&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA&lt;/body&gt;&lt;/html&gt;' </code></pre> <p>Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried</p> <pre><code>result.content.split('\n') </code></pre> <p>and</p> <pre><code>result.content.split('\r') </code></pre> <p>but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.</p> <p>Any ideas how I can parse this data? Maybe I need to fetch it differently?</p> <p>Thanks in advance!</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload