StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
18905306
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2013-09-19T21:34:55.517
FavoriteCount
0
LastActivityDate
2013-09-23T21:32:12.617
LastEditDate
2013-09-23T21:32:12.617
LastEditorUserId
411326
OwnerUserId
411326
ParentId
18841541
PostTypeId
2
Score
1
ViewCount
0
LastEditorDisplayName
text
Body
My suggestion would be to generate a list of n-grams from the key phrase and calculate the edit distance between each n-gram and the key phrase. Example: <pre><code>key phrase: "What is your name" phrase 1: "hi, my name is john doe. I live in new york. What is your name?" phrase 2: "My name is Bruce. wht's your name" </code></pre> A possible matching n-gram would be between 3 and 4 words long, therefore we create all 3-grams and 4-grams for each phrase, we should also normalize the string by removing punctuation and lowercasing everything. <pre><code>phrase 1 3-grams: "hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name" phrase 1 4-grams: "hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name" phrase 2 3-grams: "my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name" phrase 2 4-grmas: "my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name" </code></pre> Next you can do levenstein distance on each n-gram this should solve the use case you presented above. if you need to further normalize each word you can use phonetic encoders such as Double Metaphone or NYSIIS, however, I did a test with all the "common" phonetic encoders and in your case it didn't show significant improvement, phonetic encoders are more suitable for names. I have limited experience with PHP but here is a code example: <pre><code><?php function extract_ngrams($phrase, $min_words, $max_words) { echo "Calculating N-Grams for phrase: $phrase\n"; $ngrams = array(); $words = str_word_count(strtolower($phrase), 1); $word_count = count($words); for ($i = 0; $i <= $word_count - $min_words; $i++) { for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) { $ngrams[] = implode(' ',array_slice($words, $i, $j)); } } return array_unique($ngrams); } function contains_key_phrase($ngrams, $key) { foreach ($ngrams as $ngram) { if (levenshtein($key, $ngram) < 5) { echo "found match: $ngram\n"; return true; } } return false; } $key_phrase = "what is your name"; $phrases = array( "hi, my name is john doe. I live in new york. What is your name?", "My name is Bruce. wht's your name" ); $min_words = 3; $max_words = 4; foreach ($phrases as $phrase) { $ngrams = extract_ngrams($phrase, $min_words, $max_words); if (contains_key_phrase($ngrams,$key_phrase)) { echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n"; } } ?> </code></pre> And the output is something like this: <pre> Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name? found match: what is your name Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name] Calculating N-Grams for phrase: My name is Bruce. wht's your name found match: wht's your name Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name] </pre> EDIT: I noticed some suggestions to add phonetic encoding to each word in the generated n-gram. I'm not sure phonetic encoding is the best answer to this problem as they are mostly tuned to stemming names (american, german or french depending on the algorithm) and are not very good at stemming plain words. I actually wrote a test to validate this in Java (as the encoders are more readily available) here is the output: <pre> =========================== Created new phonetic matcher Engine: Caverphone2 Key Phrase: what is your name Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111 Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Phrase: [My name is Bruce. wht's your name] MATCH: false =========================== Created new phonetic matcher Engine: DoubleMetaphone Key Phrase: what is your name Encoded Key Phrase: AT AS AR NM Found match: [What is your] Encoded: AT AS AR Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: ATS AR NM Phrase: [My name is Bruce. wht's your name] MATCH: true =========================== Created new phonetic matcher Engine: Nysiis Key Phrase: what is your name Encoded Key Phrase: WAT I YAR NAN Found match: [What is your name?] Encoded: WAT I YAR NAN Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: WT YAR NAN Phrase: [My name is Bruce. wht's your name] MATCH: true =========================== Created new phonetic matcher Engine: Soundex Key Phrase: what is your name Encoded Key Phrase: W300 I200 Y600 N500 Found match: [What is your name?] Encoded: W300 I200 Y600 N500 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Phrase: [My name is Bruce. wht's your name] MATCH: false =========================== Created new phonetic matcher Engine: RefinedSoundex Key Phrase: what is your name Encoded Key Phrase: W06 I03 Y09 N8080 Found match: [What is your name?] Encoded: W06 I03 Y09 N8080 Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true Found match: [wht's your name] Encoded: W063 Y09 N8080 Phrase: [My name is Bruce. wht's your name] MATCH: true </pre> I used a levenshtein distance of 4 when running these tests, but I am pretty sure you can find multiple edge cases where using the phonetic encoder will fail to match correctly. by looking at the example you can see that because of the stemming done by the encoders you are actually more likely to have false positives when using them in this way. keep in mind that these algorithms are originally intended to find those people in the population census that have the same name and not really which english words 'sound' the same.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POString key phrase matching
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USAsaf
UserOwnerUserId
1. USAsaf
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POString key phrase matching
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTBountyClose
CommentsPostId
1. COthis is even better than iterating on all characters
 singulars
 PostPostId
 PO
 UserUserId
 USSnippet

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.