Note that there are some explanatory texts on larger screens.

plurals
  1. POBeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?
    primarykey
    data
    text
    <p>My source code looks like:</p> <pre><code>&lt;h3&gt;Header3 (Start here)&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;List items&lt;/li&gt; &lt;li&gt;Etc...&lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Header 3&lt;/h3&gt; &lt;ul&gt; &lt;li&gt;List items&lt;/li&gt; &lt;ul&gt; &lt;li&gt;Nested list items&lt;/li&gt; &lt;li&gt;Nested list items&lt;/li&gt;&lt;/ul&gt; &lt;li&gt;List items&lt;/li&gt; &lt;/ul&gt; &lt;h2&gt;Header 2 (end here)&lt;/h2&gt; </code></pre> <p>I'd like all the "li" tags following the first "h3" tag and stopping at the next "h2" tag, including all nested li tags.</p> <blockquote> <p>firstH3 = soup.find('h3')</p> </blockquote> <p>correctly finds the place I'd like to start.</p> <pre><code>firstH3 = soup.find('h3') # Start here uls = [] for nextSibling in firstH3.findNextSiblings(): if nextSibling.name == 'h2': break if nextSibling.name == 'ul': uls.append(nextSibling) </code></pre> <p>gives me a list of ULs, each with LI contents that I need.</p> <p>EXCERPT OF THE "uls" LIST:</p> <pre><code>&lt;ul&gt; ... &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks"&gt;Agent Cody Banks&lt;/a&gt;&lt;/i&gt; (2003)&lt;/li&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London"&gt;Agent Cody Banks 2: Destination London&lt;/a&gt;&lt;/i&gt; (2004)&lt;/li&gt; &lt;li&gt;Air Bud series: &lt;ul&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup"&gt;Air Bud: World Pup&lt;/a&gt;&lt;/i&gt; (2000)&lt;/li&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch"&gt;Air Bud: Seventh Inning Fetch&lt;/a&gt;&lt;/i&gt; (2002)&lt;/li&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back"&gt;Air Bud: Spikes Back&lt;/a&gt;&lt;/i&gt; (2003)&lt;/li&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Air_Buddies" title="Air Buddies"&gt;Air Buddies&lt;/a&gt;&lt;/i&gt; (2006)&lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;&lt;i&gt;&lt;a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee"&gt;Akeelah and the Bee&lt;/a&gt;&lt;/i&gt; (2006)&lt;/li&gt; ... &lt;/ul&gt; </code></pre> <p>But I'm unsure of where to go from here. I'm a newbie programmer trying to jump in to Python by building a script that scrapes <a href="http://en.wikipedia.org/wiki/2000s_in_film" rel="noreferrer">http://en.wikipedia.org/wiki/2000s_in_film</a> and extracts a list of "Movie Title (Year)".</p> <hr> <p>Update:</p> <p><strong>Final Code:</strong></p> <pre><code>lis = [] for ul in uls: for li in ul.findAll('li'): if li.find('ul'): break lis.append(li) for li in lis: print li.text.encode("utf-8") </code></pre> <p>The If-->break throws out the LI's that contain UL's since the nested LI's are now duplicated.</p> <p>Print output is now:</p> <blockquote> <ul> <li>102 Dalmatians(2000)</li> <li>10th &amp; Wolf(2006)</li> <li>11:14(2006)</li> <li>12:08 East of Bucharest(2006)</li> <li>13 Going on 30(2004)</li> <li>1408(2007)</li> <li>...</li> </ul> </blockquote> <p>Thanks</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload