Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>For this answer I'll be using this as my sample input:</p> <pre><code>Hello, my ;name is Holmes. This is a test, of a question on SO. Holmes, again. </code></pre> <p>When I'm writing a script for the first time, I find it really helpful to <a href="http://pig.apache.org/docs/r0.10.0/test.html#describe" rel="nofollow"><code>DESCRIBE</code></a> and <a href="http://pig.apache.org/docs/r0.10.0/test.html#dump" rel="nofollow"><code>DUMP</code></a> each step with some sample data so I know exactly what is happening. Doing that with your script shows:</p> <pre><code>A = load './SherlockHolmes.txt' using PigStorage(' '); -- Schema for A unknown. -- (Hello,,my,name,is,Holmes.) -- (This,is,a,test,,of,a,question,on,SO.) -- (Holmes,,again.) </code></pre> <p>So the output from <code>A</code> is a 'tuple' (really it is a schema) with an unknown number of values. Generally, if you don't know how may values are in a tuple, you should use a <a href="http://pig.apache.org/docs/r0.10.0/basic.html#bag" rel="nofollow">bag</a> instead.</p> <pre><code>B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word; -- B: {word: bytearray} -- () -- (this) -- () </code></pre> <p>When you use <code>$0</code> you are referring not to all of the words in the schema, but rather the first word. So you are only applying the <code>LOWER</code> and <code>REGEX_EXTRACT_ALL</code> to the first word. Also, note that the <a href="http://pig.apache.org/docs/r0.10.0/basic.html#flatten" rel="nofollow"><code>FLATTEN</code></a> operator is being done on a tuple, with does not produce the output that you want. You want to <code>FLATTEN</code> a bag.</p> <p><code>C</code>, <code>D</code>, and <code>E</code> all should work as you expect, so it all about massaging the data to get into a format that they can use.</p> <p>Knowing this, you can do it like this:</p> <pre><code>-- Load in the line as a chararray so that TOKENIZE can convert it into a bag A = load './tests/sh.txt' AS (foo:chararray); B1 = FOREACH A GENERATE TOKENIZE(foo, ' ') AS tokens: {T:(word: chararray)} ; -- Output from B1: -- B1: {tokens: {T: (word: chararray)}} -- ({(Hello,),(my),(;name),(is),(Holmes.)}) -- ({(This),(is),(a),(test,),(of),(a),(question),(on),(SO.)}) -- ({(Holmes,),(again.)}) -- Now inside a nested FOREACH we apply the appropriate transformations. B2 = FOREACH B1 { -- Inside a nested FOREACH you can go over the contents of a bag cleaned = FOREACH tokens GENERATE -- The .*? are needed to capture the leading and trailing punc. FLATTEN(REGEX_EXTRACT_ALL(LOWER(word),'.*?([a-z]+).*?')) as word ; -- Cleaned is a bag, so when we FLATTEN it we get one word per line GENERATE FLATTEN(cleaned) ; } </code></pre> <p>So now the output of <code>B2</code> is:</p> <pre><code>B2: {cleaned::word: bytearray} (hello) (my) (name) (is) (holmes) (this) (is) (a) (test) (of) (a) (question) (on) (so) (holmes) (again) </code></pre> <p>Which, when feed into <code>C</code>, <code>D</code>, and <code>E</code>, will give the desired output.</p> <p>Let me know if you need me to clarify anything.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload