StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to design Appengine Search API Index for shared documents?
text
Body
copied!<p>I'm trying to design a good schema for use with the AppEngine Search API (java) and would love some opinions given the following use case:</p> <p>In our applications, we want users to be able to search for objects of type Foo. A Foo object looks like:</p> <pre><code>{ groupId: "x1", name: "somename" someFieldA: "somevalue", someFieldB: "somevalue", someFieldC: "somevalue" } </code></pre> <p>However, a different Foo object could look like:</p> <pre><code>{ groupId: "x2", name: "somename" someFieldD: "somevalue", someFieldE: "somevalue", someFieldF: "somevalue" } </code></pre> <p><strong>The group id's are important:</strong> </p> <ul> <li>the groupId field determines what properties each Foo object has (i.e. someFieldA, someFieldB, someFieldC only exist for Foo's with a groupId of X1)</li> <li>each user only has access to Foo's with a certain group id</li> </ul> <p>So, the use case I want to solve is that a user should be able to search for Foo's (by any of its fields) given that they only have access to certain Foo's. Here are two solutions that both have drawbacks:</p> <p><strong>Solution 1:</strong></p> <ul> <li>Create 1 index for all Foo's. </li> <li>The fields of this index are the SUPERSET of every field in every Foo. </li> <li>This works great because a users search can be translated to: <code>userquery AND (groupId:X OR groupId:Y OR groupId:Z)</code></li> <li>its also good because all Foo's regardless or their groupId are ranked and sorted relative to each other.</li> <li><strong>I don't think approach works though</strong>, because there is a 1000 field limit on every schema, and there could be enough groupids such that the superset of all the fields of all the Foo's is more than 1000 fields</li> </ul> <p><strong>Solution 2:</strong></p> <ul> <li>create 1 index <strong>per groupId</strong></li> <li>a users search is there for translated into async searches (1 per groupid that the user has access to) and then the results must be combined</li> <li>this has the benefit that we won't run into the 1000 field limit</li> <li>one downside is that this potentially costs more as you're doing more than 1 query to the search api</li> <li>the more important downside is there does not seem to be an easy way to combine the results of each individual query. If each of the queries returns results, the score for each returned document is normalized to all the results in that query, so how would you combine results from different queries?</li> </ul> <p>It seems like solution 2 is most ideal - but I can't figure out how to get around the combining/ranking of the results issue.</p> <p><strong>Any ideas?</strong></p> <p><strong>--UPDATE 1--</strong></p> <p>Here are some more concrete examples of what the documents would look like:</p> <pre><code>{ groupId:"Hiring Process", name: "Bob Smith", position: "Software Engineer", yearsOfExperience: 6 } { groupId:"Sales Process", name: "Frank J", company: "Engineering Engineer Inc.", contactInfo: "555-555-5555" } { groupId:"Hiring Process", name: "Jane Doe", position: "Marketing", yearsOfExperience: 3 } { groupId:"Sales Process", name: "Jane Moe", company: "Google", contactInfo: "666-666-6666" } </code></pre> <p>As you can see above, these documents represent people. Each object either has a group id of "Sales Process" or "Hiring Process". Notice that the fields of the document are different based on which groupId they have. A single user in our system has access to all of the information about all of the people in both processes. </p> <p>So lets say our user does a search for <code>engineer</code>, that should return 2 results, 1 for <code>Bob Smith</code> and one for <code>Frank J</code>. However, the <code>Frank J</code> result should be ranked/scored higher because the word "engineer" appears twice in the document.</p> <p>Because the superset of the number of fields of all the documents may have a size >1000, I don't think I can put all the documents in 1 index. If I shard the indexes (1 per group id), how do I rank/score across several sets of search results?</p> <p>*<em>--UPDATE 2--</em></p> <p>The reason that we would exceed the 1000 field limit is because the schema of Foo objects are user configurable. So for example, a user could create a groupId called "Sales Process" and add a few user defined fields like "lead source", "product interested in", "close date", etc.</p> <p>Because each user can customize their groups, across millions of users, the superset of all fields is definitely >1000. The example Foo objects listed above are a little simplistic. The groupId is actually an Id pointing to a user created schema of all the custom fields they want. And the foo objects actually contain the values for those fields.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload