Note that there are some explanatory texts on larger screens.

plurals
  1. POBeautifulSoup in Python not parsing right
    primarykey
    data
    text
    <p>I am running Python 2.7.5 and using the built-in html parser for what I am about to describe.</p> <p>The task I am trying to accomplish is to take a chunk of html that is essentially a recipe. Here is an example.</p> <p><code>html_chunk = "&lt;h1&gt;Miniature Potato Knishes&lt;/h1&gt;&lt;p&gt;Posted by bettyboop50 at recipegoldmine.com May 10, 2001&lt;/p&gt;&lt;p&gt;Makes about 42 miniature knishes&lt;/p&gt;&lt;p&gt;These are just yummy for your tummy!&lt;/p&gt;&lt;p&gt;3 cups mashed potatoes (about&lt;br&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2 very large potatoes)&lt;br&gt;2 eggs, slightly beaten&lt;br&gt;1 large onion, diced&lt;br&gt;2 tablespoons margarine&lt;br&gt;1 teaspoon salt (or to taste)&lt;br&gt;1/8 teaspoon black pepper&lt;br&gt;3/8 cup Matzoh meal&lt;br&gt;1 egg yolk, beaten with 1 tablespoon water&lt;/p&gt;&lt;p&gt;Preheat oven to 400 degrees F.&lt;/p&gt;&lt;p&gt;Sauté diced onion in a small amount of butter or margarine until golden brown.&lt;/p&gt;&lt;p&gt;In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.&lt;/p&gt;&lt;p&gt;Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned.&lt;/p&gt;"</code></p> <p>The goal is to separate out the header, junk, ingredients, instructions, serving, and number of ingredients.</p> <p>Here is my code that accomplishes that</p> <pre><code>from bs4 import BeautifulSoup def list_to_string(list): joined = "" for item in list: joined += str(item) return joined def get_ingredients(soup): for p in soup.find_all('p'): if p.find('br'): return p def get_instructions(p_list, ingredient_index): instructions = [] instructions += p_list[ingredient_index+1:] return instructions def get_junk(p_list, ingredient_index): junk = [] junk += p_list[:ingredient_index] return junk def get_serving(p_list): for item in p_list: item_str = str(item).lower() if ("yield" or "make" or "serve" or "serving") in item_str: yield_index = p_list.index(item) del p_list[yield_index] return item def ingredients_count(ingredients): ingredients_list = ingredients.find_all(text=True) return len(ingredients_list) def get_header(soup): return soup.find('h1') def html_chunk_splitter(soup): ingredients = get_ingredients(soup) if ingredients == None: error = 1 header = "" junk_string = "" instructions_string = "" serving = "" count = "" else: p_list = soup.find_all('p') serving = get_serving(p_list) ingredient_index = p_list.index(ingredients) junk_list = get_junk(p_list, ingredient_index) instructions_list = get_instructions(p_list, ingredient_index) junk_string = list_to_string(junk_list) instructions_string = list_to_string(instructions_list) header = get_header(soup) error = "" count = ingredients_count(ingredients) return (header, junk_string, ingredients, instructions_string, serving, count, error) </code></pre> <p>It works well except in situations where I have chunks that contain strings like <code>"Sauté"</code> because <code>soup = BeautifulSoup(html_chunk)</code> causes Sauté to turn into Sauté and this is a problem because I have a huge csv file of recipes like the html_chunk and I'm trying to structure all of them nicely and then get the output back into a database. I tried checking it Sauté comes out right using this <a href="http://www.play-hookey.com/htmltest/" rel="nofollow">html previewer</a> and it still comes out as Sauté. I don't know what to do about this.</p> <p>What's stranger is that when I do what BeautifulSoup's documentation shows</p> <pre><code>BeautifulSoup("Sacr&amp;eacute; bleu!") # &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;Sacré bleu!&lt;/body&gt;&lt;/html&gt; </code></pre> <p>I get</p> <pre><code># Sacr├⌐ bleu! </code></pre> <p>But my colleague tried that on his Mac, running from terminal, and he got exactly what the documentation shows.</p> <p>I really appreciate all your help. Thank you.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload