diagramming software wireframing/prototyping software

Monday, January 4, 2010

An App Engine limitation you didn't know about

You probably know all of App Engine's documented datastore limitations. An important one is that a single entity can't have more than 5000 index entries. Usually this means that you have to be careful with queries that have multiple filters on a ListProperty (or StringListProperty, of course). If the query needs a composite index (in index.yaml) this can lead to an exploding index. Basically, when filtering on a ListProperty you should neither order() your query nor use inequality filters on any properties.

In other words, you're safe when you only use equality filters and no ordering. Right? Wrong, unfortunately. And this leads us to a limitation most people don't know about. In some cases you need a composite (and possibly exploding) datastore index even if you follow that rule! In fact, this is also true if you don't work with a ListProperty. It's just much less likely.

What are the circumstances that lead to this misery? This is the problem. There's no simple rule that you could follow. I'll try to explain it as well as I can.

Merge-join

When you run a query that only has equality filters the datastore (normally) doesn't need a composite index. Instead, it tries to match the filters by scanning each individual property's (automatically-generated) index. The matching range of each index is marked with a yellow bar on the left:

Each index entry consists of the property value(s) and the corresponding key. Within the matching range of each property index, the datastore jumps to the next matching entity key on each index until all individual indexes point to the same entity key:

This process is repeated until enough results are found. See Ryan Barrett's presentation and slides for a more detailed example of how the index and merge-joins work (the images are taken from his slides, BTW).

While this merge-join technique works well enough in most cases, it's possible that your data makes it very difficult to find a match on all property indexes. In this case, the datastore won't find enough results within a certain time limit and instead raise a NeedIndexError, suggesting that you create a composite datastore index in order to make the query efficient enough. But wait, you've probably never seen this exception in the situation I described, right? That's because small datasets, even if they're problematic, can be merge-joined quickly enough. The problem only appears with big datasets that have an index shape as described in this post.

Who's affected?

One application affected by this limitation is full-text search based on StringListProperty. This includes SearchableModel and almost all the other full-text search solutions for App Engine. The search terms seem to lead to a lot of merge-join situations that match some of the terms, but rarely all of them. The query gets slowed down by all those "almost" matches.

You could imagine the same problem with a query that has filters on two non-list properties if almost every second entity matched one filter and almost every other second entity matched the other filter. The merge-join would constantly jump to the next match on each index, but rarely find a full match on both indexes together.

From my experience, the problem starts to appear with full-text search at several 10,000s searchable documents (think 50,000 Wikipedia articles) in combination with queries containing at least five words. The bigger your index the fewer words are needed to cause the exception.

In case you see hope for alternative full-text search solutions that are not based on StringListProperty: they are still limited in scalability by other factors like the number of queries and BlobProperties you can process without causing deadline exceptions. However, I don't know how soon you'd hit those limits in practice.

Implications

How bad is the situation, really? First of all, only few queries and datasets have the properties needed for this problem. Very big datasets are more likely to be affected by it because your data might, at a large scale, look similar to the non-list example I gave above.

Still, in most cases the practical implication is that you don't have to care about it, at all. Yeah, you might need a composite index for your queries on non-list properties. So what? In fact, this is a valid optimization technique which you might want to use for big datasets, anyway.

You're only in trouble if you use queries with multiple filters on list properties and you expect your datastore to become big enough and a composite index would hit the 5000 index entries limit. This doesn't have to be the case if the ListProperty is small and you only have two or three filters on that property. It's also possible that the number of entities which could "almost" match each of your queries is relatively small and independent of the total number of entities in the datastore. For example, one of your filters could restrict the result set to a few 1000 entities and the remaining filters would match in that subset (e.g., "all documents created by user U and with label L and ...").

In any case, design your queries carefully when working with ListProperty.

Did you already run into this issue? Do you have a great example situation that explains the problem better? You should add your comment below! I'd love to hear your input.

3 comments:

  1. I'm confused - maybe you can clarify. From Brett's talk at Google I/O last year - he explained how these "zig-zag queries" are implemented.

    I was under the impression that they always use the instance key as a secondary ordering for all indexes. So, if you are walking down two indexes to find the first match - you can quickly pass all the records that have a smaller instance key (than any of the "other" indices).

    Thinking out loud, I guess you can have a case with many instances "partially matching" a query - but very few "completely matching" - so the query engine is walking very slowly through the parallel indices but not making any progress in returning a result set.

    But here I would think you would need AT LEAST THREE properties in your query - such that many records match 2 out of 3 search terms - but few match all 3.

    ReplyDelete
  2. On second thought - I think I see what can happen even for 2 properties:

    You have 1 million "John ____" in the database and one million "______ Smith" in the database - but NO "John Smith".

    Both the Firstname index and the Lastname index could interleave records so that scanning progress is slow.

    An example (ordered by instance key):

    1. John Apple
    2. Alvin Smith
    3. John Zucker
    4. Zed Smith
    5. John Smith

    The zig zag query will have to visit each of these records in order - and even the instance index won't allow it to skip any records.

    ReplyDelete
  3. Mike, thanks for the more detailed example! This should help everyone to understand the problem better.

    Also, thanks for fav'ing the Django non-relational project on your site! More attention is definitely helpful.

    ReplyDelete