Note that there are some explanatory texts on larger screens.

plurals
  1. POUse Scala parser combinator to parse CSV files
    text
    copied!<p>I'm trying to write a CSV parser using Scala parser combinators. The grammar is based on <a href="http://tools.ietf.org/html/rfc4180#page-2">RFC4180</a>. I came up with the following code. It almost works, but I cannot get it to correctly separate different records. What did I miss? </p> <pre><code>object CSV extends RegexParsers { def COMMA = "," def DQUOTE = "\"" def DQUOTE2 = "\"\"" ^^ { case _ =&gt; "\"" } def CR = "\r" def LF = "\n" def CRLF = "\r\n" def TXT = "[^\",\r\n]".r def file: Parser[List[List[String]]] = ((record~((CRLF~&gt;record)*))&lt;~(CRLF?)) ^^ { case r~rs =&gt; r::rs } def record: Parser[List[String]] = (field~((COMMA~&gt;field)*)) ^^ { case f~fs =&gt; f::fs } def field: Parser[String] = escaped|nonescaped def escaped: Parser[String] = (DQUOTE~&gt;((TXT|COMMA|CR|LF|DQUOTE2)*)&lt;~DQUOTE) ^^ { case ls =&gt; ls.mkString("")} def nonescaped: Parser[String] = (TXT*) ^^ { case ls =&gt; ls.mkString("") } def parse(s: String) = parseAll(file, s) match { case Success(res, _) =&gt; res case _ =&gt; List[List[String]]() } } println(CSV.parse(""" "foo", "bar", 123""" + "\r\n" + "hello, world, 456" + "\r\n" + """ spam, 789, egg""")) // Output: List(List(foo, bar, 123hello, world, 456spam, 789, egg)) // Expected: List(List(foo, bar, 123), List(hello, world, 456), List(spam, 789, egg)) </code></pre> <h1>Update: problem solved</h1> <p>The default RegexParsers ignore whitespaces including space, tab, carriage return, and line breaks using the regular expression <code>[\s]+</code>. The problem of the parser above unable to separate records is due to this. We need to disable skipWhitespace mode. Replacing whiteSpace definition to just <code>[ \t]}</code> does not solve the problem because it will ignore all spaces within fields (thus "foo bar" in the CSV becomes "foobar"), which is undesired. The updated source of the parser is thus </p> <pre><code>import scala.util.parsing.combinator._ // A CSV parser based on RFC4180 // http://tools.ietf.org/html/rfc4180 object CSV extends RegexParsers { override val skipWhitespace = false // meaningful spaces in CSV def COMMA = "," def DQUOTE = "\"" def DQUOTE2 = "\"\"" ^^ { case _ =&gt; "\"" } // combine 2 dquotes into 1 def CRLF = "\r\n" | "\n" def TXT = "[^\",\r\n]".r def SPACES = "[ \t]+".r def file: Parser[List[List[String]]] = repsep(record, CRLF) &lt;~ (CRLF?) def record: Parser[List[String]] = repsep(field, COMMA) def field: Parser[String] = escaped|nonescaped def escaped: Parser[String] = { ((SPACES?)~&gt;DQUOTE~&gt;((TXT|COMMA|CRLF|DQUOTE2)*)&lt;~DQUOTE&lt;~(SPACES?)) ^^ { case ls =&gt; ls.mkString("") } } def nonescaped: Parser[String] = (TXT*) ^^ { case ls =&gt; ls.mkString("") } def parse(s: String) = parseAll(file, s) match { case Success(res, _) =&gt; res case e =&gt; throw new Exception(e.toString) } } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload