StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POFind nearest link with BeautifullSoup (python)
text
Body
copied!<p>I am doing a small project where I extract occurences of political leaders in newspapers. Sometimes a politician will be mentioned, and there is neither a parent or child with a link. (due I guess to semantically bad markup).</p> <p>So I want to create a function that can find the nearest link, and then extract that. In the case below the search string is <code>Rasmussen</code> and the link I want is: <code>/307046</code>.</p> <pre><code>#-*- coding: utf-8 -*- from bs4 import BeautifulSoup import re tekst = ''' <li> <div class="views-field-field-webrubrik-value"> <h3> <a href="/307046">Claus Hjort spiller med mrkede kort</a> </h3> </div> <div class="views-field-field-skribent-uid"> <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div> </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Løkke Rasmussens genkomst som statsministe </div> </div> <span class="views-field-view-node"> <span class="actions"> <a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a> </span> </span> </li> ''' to_find = "Rasmussen" soup = BeautifulSoup(tekst) contexts = soup.find_all(text=re.compile(to_find)) def find_nearest(element, url, direction="both"): """Find the nearest link, relative to a text string. When complete it will search up and down (parent, child), and only X levels up down. These features are not implemented yet. Will then return the link the fewest steps away from the original element. Assumes we have already found an element""" # Is the nearest link readily available? # If so - this works and extracts the link. if element.find_parents('a'): for artikel_link in element.find_parents('a'): link = artikel_link.get('href') # sometimes the link is a relative link - sometimes it is not if ("http" or "www") not in link: link = url+link return link # But if the link is not readily available, we will go up # This is (I think) where it goes wrong # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ if not element.find_parents('a'): element = element.parent # Print for debugging print element #on the 2nd run (i.e <li> this finds <a href=/307056> # So shouldn't it be caught as readily available above? print u"Found: %s" % element.name # the recursive call find_nearest(element,url) # run it if contexts: for a in contexts: find_nearest( element=a, url="http://information.dk") </code></pre> <p>The direct call below works:</p> <pre><code>print contexts[0].parent.parent.parent.a['href'].encode('utf-8') </code></pre> <p>For reference the whole sorry code is on bitbucket: <a href="https://bitbucket.org/achristoffersen/politikere-i-medierne" rel="nofollow">https://bitbucket.org/achristoffersen/politikere-i-medierne</a> </p> <p>(p.s. Using BeautifullSoup 4)</p> <hr> <p><strong>EDIT</strong>: SimonSapin asks me to define nearest: By nearest I mean the link that are the fewest nesting levels away from the search term, in either direction. In the text above, the <code>a href</code> produced by the drupal-based newspaper site, is neither a direct parent or child of the tag where the search string is found. So BeautifullSoup Can't find it. </p> <p>I suspect a 'fewest charachters' away would often work too. In that case a soulution could be hacked together with find and rfind - but I would really like to do this via BS. Since this would work: <code>contexts[0].parent.parent.parent.a['href'].encode('utf-8')</code> it must be possible to generalise that to a script.</p> <p><strong>EDIT</strong>: Maybe I should emphasize that I'am looking for a BeautifulSoup solution. Combining BS with a custom/simpel breath-first-search as suggested by @erik85 would quickly become messy I think.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload