Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I don't think gawk can necessarily handle this 500mb, but you can accomplish the same thing in most languages( java, perl etc ). I'm also assuming that just listing the edges is fine w/o recompiling them into a single line.</p> <p>I'll assume an input file ( "input" ) like this( where the urls are names now ):</p> <pre><code>joe;proj2,proj9,pro8 erin;proj1,proj7,pro8 sue;proj47,pro7 elmo;pro7 </code></pre> <p>run the following:</p> <pre><code>cat input | gawk -F"," '{ split( $1, arr, ";" ); printf( "%s;%s\n", arr[1], arr[2] ); for( i = 2; i &lt;= NF; i++ ) { printf( "%s;%s\n", arr[1], $i ); } }' </code></pre> <p>which produces the following output:</p> <pre><code>joe;proj2 joe;proj9 joe;pro8 erin;proj1 erin;proj7 erin;pro8 sue;proj47 sue;pro7 elmo;pro7 </code></pre> <p>These appear to be edges according to <a href="https://gephi.org/users/supported-graph-formats/csv-format/" rel="nofollow">gephi csv-formats</a></p> <p>Or you could further process the output.</p> <p>What I don't remember is whether or now awk/gawk consume the whole file up front. If the output's alright then processing each line into multiples would work in many other languages.</p> <hr> <p>Okay - while I still think my first attempt is more useful, here's a 2nd version where everything is slightly rewritten and includes a new gawk to make a dot file digraph output( which lets me check it out in Yed - see comment ).</p> <p>"input" file like ( typos in first attempt dropped the "j" in "proj" ):</p> <pre><code>joe;proj2,proj9,proj8 erin;proj1,proj7,proj8 sue;proj47,proj7 elmo;proj7 </code></pre> <p>make an executable file out of the following ( I'll call it "explode" ):</p> <pre><code>gawk -F"," ' { count = split( $1, sc, ";" ); printf( "%s %s\n", sc[2], sc[1] ); for( i = 2; i &lt;= NF; i++ ) { printf( "%s %s\n", $i, sc[1] ); } } ' </code></pre> <p>which changes the output or the original gawk because it's going to be used for sorting/feeding the next gawk script (another executable file; I called this one "combine" ):</p> <pre><code>gawk -F" " ' BEGIN { printf( "digraph similar_users (\n" ); project_name = ""; users = ""; } { if( $1 ~ project_name ) build_users_string( $2 ); else { make_user_nodes( users ); users = ""; build_users_string( $2 ); } project_name = $1 } function build_users_string( u ) { users = sprintf( "%s%s%s", users, length( users ) == 0 ? "" : " ", u ); } function make_user_nodes( u ) { if( (count = split( u, arr, " " )) &lt;= 1 ) return 1; for( i = 1; i &lt;= count; i++ ) { printf( "%s -&gt; ", arr[ i ] ); for( j = 1; j &lt;= count; j++ ) if( j != i ) { end = (i == count ) ? count-1 : count; printf( "%s%s", arr[j], j != end ? "," : ";\n" ); } } return( count ); } END { make_user_nodes( users ); printf( ")\n" ); } ' </code></pre> <p>which reads a sorted input file from the "explode" script and when run as follows:</p> <pre><code>cat input | explode | sort | combine &gt; output.dot </code></pre> <p>produces the file "output.dot" where "user -> list of users who were associated with the same project"</p> <pre><code>digraph similar_users ( elmo -&gt; erin,sue; erin -&gt; elmo,sue; sue -&gt; elmo,erin; erin -&gt; joe; joe -&gt; erin; ) </code></pre> <p>The memory used should only be as bad as the sort and the largest project/user conversion in the last script because it only processes the users grouped by project whenever the project name changes. A more memory intensive "single pass" would put all users in a map of proj -> ( list of users ) and do all the processing of the map at the end. Notice that users pointing to a single project are dropped.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload