StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
2421378
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
6
CommunityOwnedDate
CreationDate
2010-03-10T23:09:45.410
FavoriteCount
0
LastActivityDate
2010-03-11T21:19:31.753
LastEditDate
2010-03-11T21:19:31.753
LastEditorUserId
84270
OwnerUserId
84270
ParentId
2420412
PostTypeId
2
Score
22
ViewCount
0
LastEditorDisplayName
text
Body
Before you read on, have you looked at <a href="http://biopython.org/wiki/Main_Page" rel="noreferrer">biopython</a>? It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1. If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match: <pre><code>any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome))) </code></pre> but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does). You can overcome (1) with a function like this: <pre><code>def Hamming_check_0_or_1(genome, posn, sequence): errors = 0 for i in xrange(25): if genome[posn+i] != sequence[i]: errors += 1 if errors >= 2: return errors return errors </code></pre> Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed. Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4. When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language. Update: Working code for a bit-parallel method I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case. The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C. Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem. Details from <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854" rel="noreferrer">this technical report</a>. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below. <pre><code># coding: ascii from collections import defaultdict import re _DEBUG = 0 # "Fast Text Searching with Errors" by Sun Wu and Udi Manber # TR 91-11, Dept of Computer Science, University of Arizona, June 1991. # http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854 def WM_approx_Ham1_search(pattern, text): """Generate (Hamming_dist, start_offset) for matches with distance 0 or 1""" m = len(pattern) S_table = defaultdict(int) for i, c in enumerate(pattern): S_table[c] |= 1 << i R0 = 0 R1 = 0 mask = 1 << (m - 1) for j, c in enumerate(text): S = S_table[c] shR0 = (R0 << 1) | 1 R0 = shR0 & S R1 = ((R1 << 1) | 1) & S | shR0 if _DEBUG: print "j= %2d msk=%s S=%s R0=%s R1=%s" \ % tuple([j] + map(bitstr, [mask, S, R0, R1])) if R0 & mask: # exact match yield 0, j - m + 1 elif R1 & mask: # match with one substitution yield 1, j - m + 1 if _DEBUG: def bitstr(num, mlen=8): wstr = "" for i in xrange(mlen): if num & 1: wstr = "1" + wstr else: wstr = "0" + wstr num >>= 1 return wstr def Ham_dist(s1, s2): """Calculate Hamming distance between 2 sequences.""" assert len(s1) == len(s2) return sum(c1 != c2 for c1, c2 in zip(s1, s2)) def long_check(pattern, text): """Naively and understandably generate (Hamming_dist, start_offset) for matches with distance 0 or 1""" m = len(pattern) for i in xrange(len(text) - m + 1): d = Ham_dist(pattern, text[i:i+m]) if d < 2: yield d, i def Paul_McGuire_regex(pattern, text): searchSeqREStr = ( '(' + pattern + ')|(' + ')|('.join( pattern[:i] + "[ACTGN]".replace(c,'') + pattern[i+1:] for i,c in enumerate(pattern) ) + ')' ) searchSeqRE = re.compile(searchSeqREStr) for match in searchSeqRE.finditer(text): locn = match.start() dist = int(bool(match.lastindex - 1)) yield dist, locn if __name__ == "__main__": genome1 = "TTTACGTAAACTAAACTGTAA" # 01234567890123456789012345 # 1 2 tests = [ (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"), ("T" * 10, "TTTT"), ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match ] nfailed = 0 for genome, patterns in tests: print "genome:", genome for pattern in patterns.split(): print pattern a1 = list(WM_approx_Ham1_search(pattern, genome)) a2 = list(long_check(pattern, genome)) a3 = list(Paul_McGuire_regex(pattern, genome)) print a1 print a2 print a3 print a1 == a2, a2 == a3 nfailed += (a1 != a2 or a2 != a3) print "***", nfailed </code></pre>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POSearch for string allowing for one mismatch in any location of the string
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USJohn Machin
UserOwnerUserId
1. USJohn Machin
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POSearch for string allowing for one mismatch in any location of the string
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.