Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>To quickly extract the values for a particular key, I personally like to use "grep -o", which only returns the regex's match. For example, to get the "text" field from tweets, something like:</p> <pre><code>grep -Po '"text":.*?[^\\]",' tweets.json </code></pre> <p>This regex is more robust than you might think; for example, it deals fine with strings having embedded commas and escaped quotes inside them. I think with a little more work you could make one that is actually guaranteed to extract the value, if it's atomic. (If it has nesting, then a regex can't do it of course.)</p> <p>And to further clean (albeit keeping the string's original escaping) you can use something like: <code>| perl -pe 's/"text"://; s/^"//; s/",$//'</code>. (I did this for <a href="https://gist.github.com/1024217" rel="noreferrer">this analysis</a>.)</p> <p>To all the haters who insist you should use a real JSON parser -- yes, that is essential for correctness, but</p> <ol> <li>To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting.</li> <li><code>grep -o</code> is orders of magnitude faster than the Python standard <code>json</code> library, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just because <code>json</code> is slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)</li> </ol> <p>To write maintainable code, I always use a real parsing library. I haven't tried <a href="https://github.com/micha/jsawk" rel="noreferrer">jsawk</a>, but if it works well, that would address point #1.</p> <p>One last, wackier, solution: I wrote a script that uses Python <code>json</code> and extracts the keys you want, into tab-separated columns; then I pipe through a wrapper around <code>awk</code> that allows named access to columns. <a href="https://github.com/brendano/tsvutils" rel="noreferrer">In here: the json2tsv and tsvawk scripts</a>. So for this example it would be:</p> <pre><code>json2tsv id text &lt; tweets.json | tsvawk '{print "tweet " $id " is: " $text}' </code></pre> <p>This approach doesn't address #2, is more inefficient than a single Python script, and it's a little brittle: it forces normalization of newlines and tabs in string values, to play nice with awk's field/record-delimited view of the world. But it does let you stay on the command line, with more correctness than <code>grep -o</code>.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload