Note that there are some explanatory texts on larger screens.

plurals
  1. POMost efficient way to index words in a document?
    primarykey
    data
    text
    <p>This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):</p> <pre><code>[ "This is sentence 1 as an example", "This is sentence 1 as another example", "This is sentence 2", "This is sentence 3 as another example ", "This is sentence 4" ] </code></pre> <p>what is the best way to code the following function?</p> <pre><code>def GetSentences(word1, word2, position): return "" </code></pre> <p>where given two words, <code>word1</code>, <code>word2</code> and a position <code>position</code>, the function should return the list of all sentences satisfying that constraint. For example:</p> <pre><code>GetSentences("sentence", "another", 3) </code></pre> <p>should return sentences <code>1</code> and <code>3</code> as the index of the sentences. My current approach was using a dictionary like this:</p> <pre><code>Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: []))) for sentenceIndex, sentence in enumerate(sentences): words = sentence.split() for index, word in enumerate(words): for i, word2 in enumerate(words[index:): Index[word][word2][i+1].append(sentenceIndex) </code></pre> <p>But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload