Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You can use <a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">BeautifulSoup</a> for parsing the HTML string.</p> <p>Here is some code you might try,<br> It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.</p> <pre><code>from bs4 import BeautifulSoup as bs dic = {} data = \ """ &lt;p&gt; &lt;strong&gt;Name:&lt;/strong&gt; Pasan &lt;br/&gt; &lt;strong&gt;Surname: &lt;/strong&gt; Wijesingher &lt;br/&gt; &lt;strong&gt;Former/AKA Name:&lt;/strong&gt; No Former/AKA Name &lt;br/&gt; &lt;strong&gt;Gender:&lt;/strong&gt; Male &lt;br/&gt; &lt;strong&gt;Language Fluency:&lt;/strong&gt; ENGLISH &lt;br/&gt; &lt;/p&gt; """ soup = bs(data) # Get the text on the html through BeautifulSoup text = soup.get_text() # parsing the text lines = text.splitlines() for line in lines: # check if line has ':', if it doesn't, move to the next line if line.find(':') == -1: continue # split the string at ':' parts = line.split(':') # You can add more tests here like # if len(parts) != 2: # continue # stripping whitespace for i in range(len(parts)): parts[i] = parts[i].strip() # adding the vaules to a dictionary dic[parts[0]] = parts[1] # printing the data after processing print '%16s %20s' % (parts[0],parts[1]) </code></pre> <p>A tip: If you are going to use BeautifulSoup to parse HTML,<br> You should have certain attributes like <code>class=input</code> or <code>id=10</code>, That is, you keep all tags of the same type to be the same id or class.</p> <hr> <p><strong>Update</strong><br> Well for your comment, see the code below<br> It applies the tip above, making life (and coding) a lot easier</p> <pre><code>from bs4 import BeautifulSoup as bs c_addr = [] id_addr = [] data = \ """ &lt;h2&gt;Primary Location&lt;/h2&gt; &lt;div class="address" id="10"&gt; &lt;p&gt; No. 4&lt;br&gt; Private Drive,&lt;br&gt; Sri Lanka&amp;nbsp;ON&amp;nbsp;&amp;nbsp;K7L LK &lt;br&gt; """ soup = bs(data) for i in soup.find_all('div'): # get data using "class" attribute addr = "" if i.get("class")[0] == u'address': # unicode string text = i.get_text() for line in text.splitlines(): # line-wise line = line.strip() # remove whitespace addr += line # add to address string c_addr.append(addr) # get data using "id" attribute addr = "" if int(i.get("id")) == 10: # integer text = i.get_text() # same processing as above for line in text.splitlines(): line = line.strip() addr += line id_addr.append(addr) print "id_addr" print id_addr print "c_addr" print c_addr </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload