Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>(I really hope you mean a unix/linux based shell)</p> <p>It will help in the future to get a reasonable answer if you edit your post to include examples of expected inputs (2 small sample files would be perfect) AND the output you need for you 'bag of words'. Also, you are allowed to use 5 tags on a question, so indicate basic OS as unix/linux/Windows/other? and a possible programming language. Note that when you hover over a tag, you'll see how many followers there are. Don't waste valuable tags on something with only a few followers. (not that you have done that). The more followers, the more likely you will get someone that can help you.</p> <p>That said, given the 2 data sets you have included in your original message and a comment, plus my best guess at 'bag of words' lead me to propose the following:</p> <pre><code>cat carFile other stuff Keywords: engine, motor, car other stuff cat cameraFile other stuff Keywords: photo, camera, color more other stuff Keywords: road, highway, oinker final other stuff awk '{ if ($0 ~ /Keywords:/) { line=$0 sub(/Keywords: /, "", line) array[FILENAME] = array[FILENAME] ? array[FILENAME] ", " line : line } } END { for (key in array) { printf("%s:\t%s\n", key, array[key]) } } ' carFile cameraFile </code></pre> <p>output</p> <pre><code>carFile: engine, motor, car cameraFile: photo, camera, color, road, highway, oinker </code></pre> <p>Note that I have deliberately put car terms in the cameraFile. The idea is that any file you include as an argument to the awk script is processed, and any line with 'Keywords:' is added to the list <strong>based on the input filename</strong>. </p> <p>Note also that you could easily change the output, to just show the values that have been retrieved from Keyword lines (without displaying the sourceFile name) by eliminating that from the output processing in the END statement, like</p> <pre><code> for (key in array) { printf("%s\n", array[key]) } </code></pre> <hr> <p><strong>some details on awk processing</strong></p> <p>FILENAME is an automatically supplied awk variable that corresponds to the current file that is being processed.</p> <p>array is a user defined name for a awk associative array. It could have been 'a' or 'arr' or any name that meets the variable naming convention for awk (the same as all C language derived var name rules).</p> <p>sub( ... ) is the awk function for 'substitute'. I have copied the input line '$0' to a var called line, and then deleted the Keywords: part of the line.</p> <p>awk processes data via an implicit loop with code that is inside the initial '{ ... }' block.</p> <p>We scan for lines that are keywords with <code>if ($0 ~ /Keywords:/)</code> and then process only those lines in the conditional block.</p> <p>The <code>END { ... }</code> block is 'run' only after all input files have been read. In this case, we cycle thru the array on the keys, and print out key value pairs. Because we appended data into the array values (line 5), you get both sets of keywords showing up for the cameraFile.</p> <p>I hope this helps.</p> <p>P.S. Welcome to StackOverflow (S.O.). Please remember to read the FAQs, <a href="http://tinyurl.com/2vycnvr" rel="nofollow">http://tinyurl.com/2vycnvr</a> , vote for good Q/A by using the gray triangles, <a href="http://i.imgur.com/kygEP.png" rel="nofollow">http://i.imgur.com/kygEP.png</a> , and to accept the answer that best solves your problem, if any, by pressing the checkmark sign , <a href="http://i.imgur.com/uqJeW.png" rel="nofollow">http://i.imgur.com/uqJeW.png</a></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload