StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
20551832
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-12-12T18:52:39.610
FavoriteCount
0
LastActivityDate
2013-12-12T19:00:10.017
LastEditDate
2013-12-12T19:00:10.017
LastEditorUserId
476408
OwnerUserId
476408
ParentId
20548518
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's far more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format. I'll use Attoparsec. We'll also need to use the <code>ByteString</code> data type (and the <code>OverloadedStrings</code> PRAGMA which lets Haskell interpret string literals as both <code>String</code> and <code>ByteString</code>) and some combinators from <code>Control.Applicative</code> and <code>Control.Monad</code>. <pre><code>{-# LANGUAGE OverloadedStrings #-} import Data.Attoparsec.Char8 import Control.Applicative import Control.Monad import qualified Data.ByteString.Char8 as S </code></pre> First, we'll build a data type representing each record. <pre><code>data YearMonthDay = YearMonthDay { ymdYear :: Int , ymdMonth :: Int , ymdDay :: Int } deriving ( Show ) data Line = Line { agent :: Int , name :: S.ByteString , st :: Int , ud :: Int , targetNum :: Int , xyz :: Int , xDate :: YearMonthDay , year :: Int , co :: S.ByteString , encoding :: S.ByteString } deriving ( Show ) </code></pre> You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a <code>Parser Line</code> type---read that as a parser type which returns a <code>Line</code> if it succeeds. To do this, we'll build our <code>Line</code> type "inside of" the Parser using its <code>Applicative</code> interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the <code>YearMonthDay</code> type as a warm-up <pre><code>parseYMDWrong :: Parser YearMonthDay parseYMDWrong = YearMonthDay <$> decimal <*> decimal <*> decimal </code></pre> Here, <a href="http://hackage.haskell.org/package/attoparsec-0.10.4.0/docs/Data-Attoparsec-ByteString-Char8.html#v%3adecimal" rel="nofollow"><code>decimal</code></a> is a built-in Attoparsec parser which parses an integral type like <code>Int</code>. You can read this parser as nothing more than "parse three decimal numbers and use them to build my <code>YearMonthDay</code> type" and you'd be basically correct. The <code>(<*>)</code> operator (read as "apply") sequences the parses and collects their results into our <code>YearMonthDay</code> constructor function. Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the <code>'/'</code> characters which delimit the numbers inside of our <code>YearMonthDay</code>. We fix this by using the "sequence and throw away" operator <code>(<*)</code>. It's a visual pun on <code>(<*>)</code> and we use it when we want to perform a parsing action... but we don't want to keep the result. We use <code>(<*)</code> to augment the first two <code>decimal</code> parsers with their following <code>'/'</code> characters using the built-in <code>char8</code> parser. <pre><code>parseYMD :: Parser YearMonthDay parseYMD = YearMonthDay <$> (decimal <* char8 '/') <*> (decimal <* char8 '/') <*> decimal </code></pre> And we can test that this is a valid parser using Attoparsec's <code>parseOnly</code> function <pre><code>>>> parseOnly parseYMD "2013/12/12" Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12}) </code></pre> <hr> We'd like to now generalize this technique to the entire <code>Line</code> parser. There's one hitch, however. We'd like to parse <code>ByteString</code> fields like <code>"SMITH, JOHN"</code> which might contain spaces... while also delimiting each field of our <code>Line</code> by double spaces. This means that we need a special <code>ByteString</code> parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row. We can build this using the <code>scan</code> combinator. <code>scan</code> allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too. <pre><code>parseStringField :: Parser S.ByteString parseStringField = scan False step where step :: Bool -> Char -> Maybe Bool step b ' ' | b = Nothing | otherwise = Just True step _ _ = Just False </code></pre> We can again test this little piece using <code>parseOnly</code>. Let's try parsing three string fields. <pre><code>>>> let p = (,,) <$> parseStringField <*> parseStringField <*> parseStringField >>> parseOnly p "foo bar baz" Right ("foo "," bar "," baz") >>> parseOnly p "foo bar baz quux end" Right ("foo bar "," baz quux "," end") >>> parseOnly p "a sentence with no double space delimiters" Right ("a sentence with no double space delimiters","","") </code></pre> Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now. We can now build our <code>Line</code> parser. Like with <code>parseYMD</code>, we'll follow each field's parser with a delimiting parser, <code>someSpaces</code> which consumes two or more spaces. We'll use the <code>MonadPlus</code> interface to <code>Parser</code> to build this atop the built-in parser <code>space</code> by (1) parsing <code>some space</code>s and (2) checking to be sure that we got at least two of them. <pre><code>someSpaces :: Parser Int someSpaces = do sps <- some space let count = length sps if count >= 2 then return count else mzero >>> parseOnly someSpaces " " Right 2 >>> parseOnly someSpaces " " Right 4 >>> parseOnly someSpaces " " Left "Failed reading: mzero" </code></pre> And now we can build the line parser <pre><code>lineParser :: Parser Line lineParser = Line <$> (decimal <* someSpaces) <*> (parseStringField <* someSpaces) <*> (decimal <* someSpaces) <*> (decimal <* someSpaces) <*> (decimal <* someSpaces) <*> (decimal <* someSpaces) <*> (parseYMD <* someSpaces) <*> (decimal <* someSpaces) <*> (parseStringField <* someSpaces) <*> (parseStringField <* some space) >>> parseOnly lineParser "0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL " Right (Line { agent = 7 , name = "SMITH, JOHN " , st = 43 , ud = 3 , targetNum = 1234567 , xyz = 1 , xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013} , year = 2004 , co = "ABC " , encoding = "SIZE XL " }) </code></pre> And then we can just cut off the header and parse each line. <pre><code>parseFile :: S.ByteString -> [Either String Line] parseFile = map (parseOnly parseLine) . drop 14 . lines </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POParsing Printable Text File in Haskell
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USJ. Abrahamson
UserOwnerUserId
1. USJ. Abrahamson
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COThank you for the excellent answer. I was, unfortunately, unable to get it work. I kept getting complaints about type conversions from/to Char, ByteString, [Char], and String. But I was able to get the code to compile eventually, starting from the simpler answer I marked as the answer. Thank you again!
 singulars
 PostPostId
 PO
 UserUserId
 USJeff Maner
2. COMost likely the issue if you're having problems with the compiler confusing `ByteString`, `[Char]`, and `String` is the `OverloadedStrings` pragma.
 singulars
 PostPostId
 PO
 UserUserId
 USJ. Abrahamson

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.