StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
14288877
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
18
CommunityOwnedDate
CreationDate
2013-01-12T00:30:42.280
FavoriteCount
0
LastActivityDate
2013-01-17T19:25:02.100
LastEditDate
2013-01-17T19:25:02.100
LastEditorUserId
1955509
OwnerUserId
1955509
ParentId
14288669
PostTypeId
2
Score
0
ViewCount
0
LastEditorDisplayName
text
Body
EDIT After your comments below, I think this is what you want to do. I've left the original post below in case anything in that was useful to you. So, I think you want to do the following. Firstly, this code will read every separate synonym from file1 into a <code>set</code> - this is a useful structure because it will automatically remove any duplicates, and is very fast to look things up. It's like a dictionary but with only keys, no values. If you don't want to remove duplicates, we'll need to change things slightly. <pre><code>file1_data = set() with open("file1.txt", "r") as fd: for line in fd: file1_data.update(i.strip() for i in line.split("///") if i.strip()) </code></pre> Then you want to run through file2 looking for matches: <pre><code>with open("file2.txt", "r") as in_fd: with open("output.txt", "w") as out_fd: for line in in_fd: items = line.split("\t") if len(items) < 5: # This is so we don't crash if we find a line that's too short continue synonyms = set(i.strip() for i in items[4].split("|")) overlap = synonyms & file1_data if overlap: # Build string of columns from file2, stripping out 5th column. output_str = "\t".join(items[:4] + items[5:]) for item in overlap: out_fd.write("\t".join((item, output_str))) </code></pre> So what this does is open file2 and an output file. It goes through each line in file2, and first checks it has enough columns to at least have a column 5 - if not, it ignores that line (you might want to print an error). Then it splits column 5 by <code>|</code> and builds a <code>set</code> from that list (called <code>synonyms</code>). The <code>set</code> is useful because we can find the intersection of this with the previous set of all the synonyms from file1 very fast - this intersection is stored in <code>overlap</code>. What we do then is check if there was any overlap - if not, we ignore this line because no synonym was found in file1. This check is mostly for speed, so we don't bother building the output string if we're not going to use it for this line. If there was an overlap, we build a string which is the full list of columns we're going to append to the synonym - we can build this as a string once even if there's multiple matches because it's the same for each match, because it all comes from the line in file2. This is faster than building it as a string each time. Then, for each synonym that matched in file1, we write to the output a line which is the synonym, then a tab, then the rest of the line from file2. Because we split by tabs we have to put them back in with <code>"\t".join(...)</code>. This is assuming I am correct you want to remove column 5 - if you do not want to remove it, then it's even easier because you can just use the line from file2 having stripped off the newline at the end. Hopefully that's closer to what you need? ORIGINAL POST You don't give any indication of the size of the files, but I'm going to assume they're small enough to fit into memory - if not, your problem becomes slightly trickier. So, the first step is probably to open file #2 and read in the data. You can do it with code something like this: <pre><code>file2_data = {} with open("file2.txt", "r") as fd: for line in fd: items = line.split("\t") file2_data[frozenset(i.strip() for i in items[0].split("|"))] = items[1:] </code></pre> This will create <code>file2_data</code> as a dictionary which maps a word on to a list of the remaining items on that line. You also should consider whether words can repeat and how you wish to handle that, as I mentioned in my earlier comment. After this, you can then read the first file and attach the data to each word in that file: <pre><code>with open("file1.txt", "r") as fd: with open("output.txt", "w") as fd_out: for line in fd: words = set(i.strip() for i in line.split("///")) for file2_words, file2_cols in file2_data.iteritems(): overlap = file2_words & words if overlap: fd_out.write("///".join(overlap) + "\t" + "\t".join(file2_cols)) </code></pre> What you should end up with is each row in <code>output.txt</code> being one where the list of words in the two files had at least one word in common and the first item is the words in common separated by <code>///</code>. The other columns in that output file will be the other columns from the matched row in file #2. If that's not what you want, you'll need to be a little more specific. As an aside, there are probably more efficient ways to do this than the O(N^2) approach I outlined above (i.e. it runs across one entire file as many times as there are rows in the other), but that requires more detailed information on how you want to match the lines. For example, you could construct a dictionary mapping a word to a list of the rows in which that word occurs - this makes it a lot faster to check for matching rows than the complete scan performed above. This is rendered slightly fiddly by the fact you seem to want the overlaps between the rows, however, so I thought the simple approach outlined above would be sufficient without more specifics.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PORemoving unwanted characters in each line of a file then matching what is left to another file in Python
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USCartroo
UserOwnerUserId
1. USCartroo
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. PORemoving unwanted characters in each line of a file then matching what is left to another file in Python
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.