Note that there are some explanatory texts on larger screens.

plurals
  1. POParsing text with Python: unstructured but similar information with different formatting
    primarykey
    data
    text
    <p>I'm trying to parse thousands of spec sheet text files containing company, material, chemical properties, etc. (Material Safety Data Sheets, to be specific) with Python. The text files contain similar information in loosely structured formatting such that it's human readable, but unstructured enough that it's not easily parsed (e.g. not XML or CSV). In short, it's just all over the place.</p> <p>Originally the data is entered by different people working in different companies by hand. Another set of people transcribe the information into these text files (OCR it into a txt file).</p> <p>Is there a parsing library or patterns to extract bits of information of this type? (This seems to be a "common" data entry problem.) Certainly regular expressions will be used a lot. I don't have any experience with natural language processing libraries. Would they even be appropriate for the problem?</p> <p>My initial thought is to try and group the files in different caegories, then create a set of parsing functions for each format. Unfortunately his may only work for a small subset of the problem and the different cases could quickly spiral out of control. </p> <p>Since this question general I'll provide a bunch of examples illustrating the problem. </p> <p><strong>ADDRESS INFORMATION</strong><br> Each file contains company information such as information and address. The information may or may not have an identifier, it may or may not be on one line, etc. In short, there seems to be every combination. </p> <p>Ex.(w/ field info): </p> <pre><code>MANUFACTURER: Foo Bar Inc. ADDRESS: 123 Foo St. Bar, CA 90012 </code></pre> <p>Ex. (wo/ field info): </p> <pre><code>Foo Bar Inc. 123 Foo St. Bar, CA 90012 </code></pre> <p>Ex. (Sometimes extra lines between information): </p> <pre><code>FOO BAR INC. 123 FOO ST. BAR, CA 90012 </code></pre> <p>Ex. (inconsistent field names): </p> <pre><code>MANUFACTURER'S NAME: FOO BAR INC. CREATIVE DIVISION ADDRESS: 123 FOO ST. CITY, STATE &amp; ZIP: BAR, CALIFORNIA 90012 PHONE NUMBER: 310-111-2222 </code></pre> <p><strong>SECTION INFO</strong><br> The spec sheets also have similar sections but are inconsistent orders, headings, numeral types and delimiters. </p> <p>Ex:</p> <pre><code>======================================== SECTION 1 -- MATERIALS ======================================== </code></pre> <p>Ex: </p> <pre><code>Section I. Materials ------------------------------------------ </code></pre> <p>Ex: </p> <pre><code>----- Section 3 Materials </code></pre> <p><strong>And sometimes the files had their width changed, so the following line breaks.</strong> </p> <p>Ex:</p> <pre><code>=================================================== 1. Materials =================================================== </code></pre> <p>Becomes:</p> <pre><code>========================================= ========== 1. Materials ========================================= ========== </code></pre> <p><strong>Here is a complete example:</strong><br> Hopefully this will clarify the issues parsing the file. You'll notice the line wrapping, information split on different lines, etc. Not all have the exact structure, some will be formatted differently, with information in different places. Here is a link to <a href="http://www.crazychameleonbodyartsupply.com/content/msds-page-one-large.htm" rel="nofollow">a paper hard copy</a>. </p> <pre><code>MATERIAL SAFETY DATA SHEET ================================================================= ========= SECTION I-PRODUCT AND PREPARATION INFORMATION ================================================================= ========= MANUFACTURER: Some Company Inc EMERGENCY AND INFORMATION TELEPHONE (111)222-3333 ADDRESS: Some Road City, ST 12346 IDENTITY (AS USED ON LABEL AND LIST): Some Identity PREPARATION DATE: Some Date ================================================================= ========= SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION ================================================================= ========= OSHA ACGIH HAZARDOUS COMPONENTS CAS# PEL TWA TLV % (SPECIFIC CHEMICAL IDENTITY; COMMON NAME(S) ----------------------------------------------------------------- --------- Some Chemical 111-22-3 15 10 10 12.34 ================================================================= ========= SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS ================================================================= ========= Boiling Point: N/A Specific Gravity (H20=1): N/A Vapor Pressure (mm Hg): N/A Melting Point: N/A Vapor Density (AIR=1) N/A Evaporation Rate (Butyl Acetate=1) N/A Solubility in Water: None Appearance: Solid, various colors, may have slight odor. N/A = Not applicable ================================================================= ========= SECTION IV-FIRE AND EXPLOSION HAZARD DATA ================================================================= ========= FLASH POINT (METHOD USED): None FLAMMABLE LIMITS: None LEL: N/A UEL: N/A EXTINGUISHING MEDIA: None SPECIAL FIRE FIGHTING PROCEDURES: None required. UNUSUAL FIRE AND EXPLOSION HAZARDS: None. ================================================================= ========= SECTION V-REACTIVITY DATA ================================================================= ========= STABILITY: Stable CONDITIONS TO AVOID: None INCOMPATIBILITY (MATERIALS TO AVOID): None HAZARDOUS POLYMERIZATION: Will not occur ================================================================= ========= SECTION VI-HEALTH HAZARD DATA ================================================================= ========= ROUTES OF ENTRY: INHALATION: Yes SKIN: Possibly INGESTION: Possibly EYES: Possibly HEALTH HAZARDS (ACUTE AND CHRONIC): Pneumoconiosis, silicosis, emphysema, nose and throat irritation, eye irritation, skin irritation in some. CARCINOGENICITY: No applicable information found. SIGNS AND SYMPTOMS OF EXPOSURE: Coughing, sneezing; irritation of the mucous membranes; eye irritation; skin irritation or rash, dry throat. MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE: Nasal, bronchial or pulmonary conditions which tend to restrict breathing, skin abrasions. EMERGENCY AND FIRST AID PROCEDURES: Remove to fresh air, irrigate eyes, wash with soap and water, contact physician if necessary. ================================================================= ========= SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE ================================================================= ========= STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED: Normal clean-up procedures. WASTE DISPOSAL METHOD: Standard landfill methods consistent with applicable state and federal regulations. PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING: Use caution not to drop, crush, break or chip. OTHER PRECAUTIONS: Do not use at speeds greater than the not-to-exceed speed printed on the hub assembly. ================================================================= ========= SECTION VIII-CONTROL MEASURES ================================================================= ========= RESPIRATORY PROTECTION (SPECIFY TYPE): OSHA or NIOSH approved respirators may be required. VENTILATION: Local exhaust recommended. Special: N/A. Mechanical: Useful. Other: N/A. PROTECTIVE GLOVES: May be useful. EYE PROTECTION: Recommended. OTHER PROTECTIVE CLOTHING OR EQUIPMENT: Not required. WORK/HYGIENIC PRACTICES: Keep clothing and area clean. Wash to remove </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload