Note that there are some explanatory texts on larger screens.

plurals
  1. POStrategy for parsing natural language descriptions into structured data
    text
    copied!<p>I have a set of requirements and I'm looking for the best <strong>Java-based</strong> strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do).</p> <p>I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-):</p> <p>What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other tool/technology? NLP seems like it may work, but it looks really complex. I'd rather not spend a whole lot of time doing a deep dive just to find out it can't do what I'm looking for or that there is a simpler solution.</p> <h1>Requirements</h1> <p>Given these recipe ingredient descriptions....</p> <ol> <li>"8 cups of mixed greens (about 5 ounces)"</li> <li>"Eight skinless chicken thighs (about 1¼ lbs)"</li> <li>"6.5 tablespoons extra-virgin olive oil"</li> <li>"approximately 6 oz. thinly sliced smoked salmon, cut into strips"</li> <li>"2 whole chickens (3 .5 pounds each)"</li> <li>"20 oz each frozen chopped spinach, thawed"</li> <li>".5 cup parmesan cheese, grated"</li> <li>"about .5 cup pecans, toasted and finely ground"</li> <li>".5 cup Dixie Diner Bread Crumb Mix, plain"</li> <li>"8 garlic cloves, minced (4 tsp)"</li> <li>"8 green onions, cut into 2 pieces"</li> </ol> <p>I want to turn it into this....</p> <pre> |-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------| | | Measure | | | weight | weight | | | | # | value | Measure | ingredient | value | measure | preparation | Brand Name | |-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------| | 1. | 8 | cups | mixed greens | 5 | ounces | - | - | | 2. | 8 | - | skinless chicken thigh | 1.5 | pounds | - | - | | 3. | 6.5 | tablespoons | extra-virgin olive oil | - | - | - | - | | 4. | 6 | ounces | smoked salmon | - | - | thinly sliced, cut into strips | - | | 5. | 2 | - | whole chicken | 3.5 | pounds | - | - | | 6. | 20 | ounces | forzen chopped spinach | - | | thawed | - | | 7. | .5 | cup | parmesean cheese | - | - | grated | - | | 8. | .5 | cup | pecans | - | - | toasted, finely ground | - | | 9. | .5 | cup | Bread Crumb Mix, plain | - | - | - | Dixie Diner | | 10. | 8 | - | garlic clove | 4 | teaspoons | minced | - | | 11. | 8 | - | green onions | - | - | cut into 2 pieces | - | |-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------| </pre> <p>Note the diversity of the descriptions. Some things are abbreviated, some are not. Some numbers are numbers, some are spelled out. </p> <p>I would love something that does a perfect parse/translation. But, would settle for something that does reasonably well to start.</p> <p>Bonus question: after suggesting a strategy / tool, how would you go about it?</p> <p>Thanks!</p> <p>Joe</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload