StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:</p> <pre><code>Cluster 0: Brucella(2) melitensis(1) Brucellaceae(1) neotomae(1) Brucellaceae(1) Streptomyces(1) neotomae(1) Brucellaceae(1) Cluster 1: Streptomyces(2) geysiriensis(1) Streptomycetaceae(1) minutiscleroticus(1) Streptomycetaceae(1) Cluster 2: Mycobacterium(1) phocaicum(1) Mycobacteriaceae(1) Cluster 7: Mycobacterium(2) gastri(1) Mycobacteriaceae(1) kansasii(1) Mycobacteriaceae(1) Cluster 9: Hyphomicrobium(2) facile(2) Hyphomicrobiaceae(2) Cluster 10: Streptomyces(2) niger(1) Streptomycetaceae(1) olivaceiscleroticus(1) Streptomycetaceae(1) </code></pre> <p>I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.</p> <pre><code># Sys for stdio stuff import sys # re for the re.split -- this can go if you find another way to parse your data import re # A global (shame on me) for storing the data we're going to parse from stdin data = [] # read lines from standard input until it's empty (end-of-file) for line in sys.stdin: # Split lines on spaces (gobbling multiple spaces for robustness) # and trim whitespace off the beginning and end of input (strip) entry = re.split("\s+", line.strip()) # Throw the array into my global data array, it'll look like this: # [ "0", "Brucella", "melitensis", "Brucellaceae" ] # A lot of this code assumes that the first field is an integer, what # you call "cluster" in your problem description data.append(entry) # Sort, first key is expected to be an integer, and we want a numerical # sort rather than a string sort, so convert to int, then sort by # each subsequent column. The lamba is a function that returns a tuple # of keys we care about for each item data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3])) # Our recursive function -- we're basically going to treat "data" as a tree, # even though it's not. # parameters: # start - an integer telling us what line to begin working from so we needn't # walk the whole tree each time to figure out where we are. # super - An array that captures where we are in the search. This array # will have more elements in it as we deepen our traversal of the "tree" # Initially, it will be [] # In the next ply of the tree, it will be [ '0' ] # Then something like [ '0', 'Brucella' ] and so on. # data - The global data structure -- this never mutates after the sort above, # I could have just used the global directly def groupedReport(start, super, data): # Figure out what ply we're on in our depth-first traversal of the tree depth = len(super) # Count entries in the super class, starting from "start" index in the array: count = 0 # For the few records in the data file that match our "super" exactly, we count # occurrences. if depth != 0: for i in range(start, len(data)): if (data[i][0:depth] == data[start][0:depth]): count = count + 1 else: break; # We can stop counting as soon as a match fails, # because of the way our input data is sorted else: count = len(data) # At depth == 1, we're reporting about clusters -- this is the only piece of # the algorithm that's not truly abstract, and it's only for presentation if (depth == 1): sys.stdout.write("Cluster " + super[0] + ":\n") elif (depth > 0): # Every other depth: indent with 4 spaces for every ply of depth, then # output the unique field we just counted, and its count sys.stdout.write((' ' * ((depth - 1) * 4)) + data[start][depth - 1] + '(' + str(count) + ')\n') # Recursion: we're going to figure out a new depth and a new "super" # and then call ourselves again. We break out on depth == 4 because # of one other assumption (I lied before about the abstract thing) I'm # making about our input data here. This could # be made more robust/flexible without a lot of work subsuper = None substart = start for i in range(start, start + count): record = data[i] # The original record from our data newdepth = depth + 1 if (newdepth > 4): break; # array splice creates a new copy newsuper = record[0:newdepth] if newsuper != subsuper: # Recursion here! groupedReport(substart, newsuper, data) # Track our new "subsuper" for subsequent comparisons # as we loop through matches subsuper = newsuper # Track position in the data for next recursion, so we can start on # the right line substart = substart + 1 # First call to groupedReport starts the recursion groupedReport(0, [], data) </code></pre> <p>If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:</p> <pre><code>cat input.txt | python classifier.py </code></pre> <p>Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload