Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's <em>far</em> more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format.</p> <p>I'll use Attoparsec. We'll also need to use the <code>ByteString</code> data type (and the <code>OverloadedStrings</code> PRAGMA which lets Haskell interpret string literals as both <code>String</code> and <code>ByteString</code>) and some combinators from <code>Control.Applicative</code> and <code>Control.Monad</code>.</p> <pre><code>{-# LANGUAGE OverloadedStrings #-} import Data.Attoparsec.Char8 import Control.Applicative import Control.Monad import qualified Data.ByteString.Char8 as S </code></pre> <p>First, we'll build a data type representing each record.</p> <pre><code>data YearMonthDay = YearMonthDay { ymdYear :: Int , ymdMonth :: Int , ymdDay :: Int } deriving ( Show ) data Line = Line { agent :: Int , name :: S.ByteString , st :: Int , ud :: Int , targetNum :: Int , xyz :: Int , xDate :: YearMonthDay , year :: Int , co :: S.ByteString , encoding :: S.ByteString } deriving ( Show ) </code></pre> <p>You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a <code>Parser Line</code> type---read that as a parser type which returns a <code>Line</code> if it succeeds.</p> <p>To do this, we'll build our <code>Line</code> type "inside of" the Parser using its <code>Applicative</code> interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the <code>YearMonthDay</code> type as a warm-up</p> <pre><code>parseYMDWrong :: Parser YearMonthDay parseYMDWrong = YearMonthDay &lt;$&gt; decimal &lt;*&gt; decimal &lt;*&gt; decimal </code></pre> <p>Here, <a href="http://hackage.haskell.org/package/attoparsec-0.10.4.0/docs/Data-Attoparsec-ByteString-Char8.html#v%3adecimal" rel="nofollow"><code>decimal</code></a> is a built-in Attoparsec parser which parses an integral type like <code>Int</code>. You can read this parser as nothing more than "parse three decimal numbers and use them to build my <code>YearMonthDay</code> type" and you'd be basically correct. The <code>(&lt;*&gt;)</code> operator (read as "apply") sequences the parses and collects their results into our <code>YearMonthDay</code> constructor function.</p> <p>Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the <code>'/'</code> characters which delimit the numbers inside of our <code>YearMonthDay</code>. We fix this by using the "sequence and throw away" operator <code>(&lt;*)</code>. It's a visual pun on <code>(&lt;*&gt;)</code> and we use it when we want to perform a parsing action... but we don't want to keep the result.</p> <p>We use <code>(&lt;*)</code> to augment the first two <code>decimal</code> parsers with their following <code>'/'</code> characters using the built-in <code>char8</code> parser.</p> <pre><code>parseYMD :: Parser YearMonthDay parseYMD = YearMonthDay &lt;$&gt; (decimal &lt;* char8 '/') &lt;*&gt; (decimal &lt;* char8 '/') &lt;*&gt; decimal </code></pre> <p>And we can test that this is a valid parser using Attoparsec's <code>parseOnly</code> function</p> <pre><code>&gt;&gt;&gt; parseOnly parseYMD "2013/12/12" Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12}) </code></pre> <hr> <p>We'd like to now generalize this technique to the entire <code>Line</code> parser. There's one hitch, however. We'd like to parse <code>ByteString</code> fields like <code>"SMITH, JOHN"</code> which might contain spaces... while also delimiting each field of our <code>Line</code> by double spaces. This means that we need a special <code>ByteString</code> parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row.</p> <p>We can build this using the <code>scan</code> combinator. <code>scan</code> allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too.</p> <pre><code>parseStringField :: Parser S.ByteString parseStringField = scan False step where step :: Bool -&gt; Char -&gt; Maybe Bool step b ' ' | b = Nothing | otherwise = Just True step _ _ = Just False </code></pre> <p>We can again test this little piece using <code>parseOnly</code>. Let's try parsing three string fields.</p> <pre><code>&gt;&gt;&gt; let p = (,,) &lt;$&gt; parseStringField &lt;*&gt; parseStringField &lt;*&gt; parseStringField &gt;&gt;&gt; parseOnly p "foo bar baz" Right ("foo "," bar "," baz") &gt;&gt;&gt; parseOnly p "foo bar baz quux end" Right ("foo bar "," baz quux "," end") &gt;&gt;&gt; parseOnly p "a sentence with no double space delimiters" Right ("a sentence with no double space delimiters","","") </code></pre> <p>Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now.</p> <p>We can now build our <code>Line</code> parser. Like with <code>parseYMD</code>, we'll follow each field's parser with a delimiting parser, <code>someSpaces</code> which consumes two or more spaces. We'll use the <code>MonadPlus</code> interface to <code>Parser</code> to build this atop the built-in parser <code>space</code> by (1) parsing <code>some space</code>s and (2) checking to be sure that we got at least two of them.</p> <pre><code>someSpaces :: Parser Int someSpaces = do sps &lt;- some space let count = length sps if count &gt;= 2 then return count else mzero &gt;&gt;&gt; parseOnly someSpaces " " Right 2 &gt;&gt;&gt; parseOnly someSpaces " " Right 4 &gt;&gt;&gt; parseOnly someSpaces " " Left "Failed reading: mzero" </code></pre> <p>And now we can build the line parser</p> <pre><code>lineParser :: Parser Line lineParser = Line &lt;$&gt; (decimal &lt;* someSpaces) &lt;*&gt; (parseStringField &lt;* someSpaces) &lt;*&gt; (decimal &lt;* someSpaces) &lt;*&gt; (decimal &lt;* someSpaces) &lt;*&gt; (decimal &lt;* someSpaces) &lt;*&gt; (decimal &lt;* someSpaces) &lt;*&gt; (parseYMD &lt;* someSpaces) &lt;*&gt; (decimal &lt;* someSpaces) &lt;*&gt; (parseStringField &lt;* someSpaces) &lt;*&gt; (parseStringField &lt;* some space) &gt;&gt;&gt; parseOnly lineParser "0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL " Right (Line { agent = 7 , name = "SMITH, JOHN " , st = 43 , ud = 3 , targetNum = 1234567 , xyz = 1 , xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013} , year = 2004 , co = "ABC " , encoding = "SIZE XL " }) </code></pre> <p>And then we can just cut off the header and parse each line.</p> <pre><code>parseFile :: S.ByteString -&gt; [Either String Line] parseFile = map (parseOnly parseLine) . drop 14 . lines </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload