diagramming software wireframing/prototyping software

Thursday, January 28, 2010

Django Meetup slides and little update

Due to a busy week, we can only provide a short update this time.

Andy Smith from Google (and a Jaiku developer) has done a little presentation on Django on App Engine at the San Francisco Django Meetup on January 27th. Unfortunately, there is no video, but you can at least take a look at his slides. The talk was mostly about App Engine and the last third was about Django on App Engine. He also talked about the Django non-relational project, but nothing in depth and there were no demos. There were also a few other Google developers at the meetup, so hopefully we could get some more attention from Google and the Django community. We'd love to have some more impressions from people who attended the meetup. Were there any interesting questions?

Another little update is that recently I added "startswith" query support to the App Engine backend. It's based on the unicode trick described in the App Engine documentation. This means you'll need a composite index for this kind of query if you combine it with other filters and you can't order by a field other than the "startswith" filter's field. Still, this should help with getting more Django apps supported without any modifications.

If you want to write a backend for other non-relational DBs you might want to take another look at the App Engine backend. I've cleaned up the most important parts of the code and moved the reusable bits into a separate module. It's still part of the App Engine backend because we need some more backends before we can agree on a common NonrelCompiler base class for all non-relational backends. The current NonrelCompiler provides functions for in-memory filtering and sorting of results. This is useful if your DB provides a separate API for fetching entities by primary key and doesn't support additional filter rules in that case.

Tuesday, January 19, 2010

Sharding with Django on App Engine

When developing a scalable application for App Engine you need to pay attention to how often a database entity is updated because you can only update any single entity with a maximal frequency of 1-5 times per second. If the frequency of updates for any single entity is higher than this limit you can expect your application to have contention. In order to prevent such situations you can use a technique called sharding. Sharding takes advantage of the datastore's ability to handle many parallel updates on multiple distinct entities efficiently and to handle reads much faster than writes.

Let's get to the example of a simple counter for which the frequency of updates is too high for a single entity (such a counter could be used for counting the number of views for a YouTube video):
class SimpleCounterShard(models.Model):
"""Shards for the counter"""
count = models.IntegerField(default=0)
name = models.CharField(primary_key=True,
max_length=500)

NUM_SHARDS = 20
@classmethod
def get_count(cls):
"""
Retrieve the value for a given sharded counter.
"""
total = 0
for counter in SimpleCounterShard.objects.all():
total += counter.count
return total

@classmethod
@commit_locked
def increment(cls):
"""
Increment the value for a given sharded counter.
"""
index = random.randint(0,
SimpleCounterShard.NUM_SHARDS - 1)
shard_name = 'shard' + str(index)
counter = SimpleCounterShard.objects.get_or_create(
pk=shard_name)[0]
counter.count += 1
counter.save()

The basic idea is to divide the counter into N sub-counters and to compute the counter's real value (get_count) by summing over all values of the N sub-counters. While trying to increment the counter's real value (increment_count) we select one of the N sub-counters at random. As a result we avoid contention. By increasing the number of shards you can increase the maximum throughput.

There exists an App Engine specific article about sharding counters from Joe Gregorio. If you want to learn more about sharding counters take a look at that article.

I ported the code (including GeneralCounterShard) from that article to native Django such that it can theoretically be used for any database supporting optimistic transactions. The code is available here and you can test the live demo.

Django's advantage

Now we can get to an exciting example of how Django's ORM can get us out of the stone age of non-relational database development practice. Let's see an example using F() objects:
YouTubeVideo.objects.filter(pk=keyname).update(
views_count=F('views_count')+1)
The database backend can detect such "counting updates" and use sharding for App Engine or other techniques like updates via background tasks for other databases automatically in order to perform such an update. Thus Django gives us the advantage of an additional layer of abstraction. Moreover, we can formulate "counting updates" independently of the database used, so we can switch to a different database without having to change the code.

Keep in mind that sharding is a technique which can be used for much more than just counters. This is our first App Engine app ported to native Django. If you plan to port some other App Engine app to Django you can use the sharding counter port to get an idea of how to do it. And please let us know about it.

Tuesday, January 12, 2010

Native Django on App Engine

Update: Our main branch has changed to django-nonrel which is based on a much simpler strategy that became possible with Django's new multi-db support. All links in this article have been updated. The old branch is still available.

About a few months ago we started to port Django to support non-relational databases and to implement an App Engine database backend for this port. So far we ended up supporting basic Django functionality on App Engine. This post is intended to get you started using our port and to let you know what you can do and want you can't do with it at the moment. So let's start!

Installation

In order to use our port on App Engine you have to clone a few repositories first. These are the non-relational django port, djangoappengine and the django-testapp. So what are all these repos for?
First in order to let you start a new Django project as fast as possible we created the django-testapp repo. It basically consists of a modified manage.py file to support the App Engine development server, and all corresponding settings in order to specify App Engine as the backend for Django. So the first step is to clone the django-testapp repo.
The djangoappengine repo contains all App Engine related backends for the non-relational port e.g. the email backend and the query backend for Django. So as the next step clone the djangoappengine package into the "common-apps" folder of the testapp repo.
The non-relational Django repo is our modified Django port. It contains changes in Django itself to support non-relational databases. Clone it and link the repository's "django" folder into the "common-apps" folder.
Your folder structure should now look similar to this:

.../non-relational_django_port
.../django-testapp
.../django-testapp/common-apps/djangoappengine
.../django-testapp/common-apps/link_to_django_folder

Now you should be able to create a new app in the testapp folder and use native Django models on App Engine. Try it!

Supported and unsupported features for the App Engine backend

Field types
First we added support for most Django field types but not all. Here is the list of Django field types which currently are not supported (but there are chances that a few of them will get ported in the near future):
  • FileField
  • FilePathField
  • OneToOneField
  • ManyToManyField
  • DecimalField
  • ImageField
All other ones are supported fully. In addition we support all Django field options except for options which can't be used on App Engine. These are unique, unique_for_date, unique_for_month and unique_for_year.

Additionally there exist App Engine properties which do not have been ported to Django field types yet, so you cannot use them. Maybe the most mentionable are ListPorperties. So any of your projects using ListPorperties can't be easily converted to a Django project.
Here is the list of not explicitly supported App Engine properties:
  • ByteStringProperty
  • ListProperty and StringListProperty
  • ReferenceProperty and SelfReferenceProperty
  • blobstore.BlobReferenceProperty
  • UserProperty
  • BlobProperty
  • GeoPtProperty
  • RatingProperty
Nevertheless you can use many-to-one relationships, just use a ForeignKey. But internally the ForeignKey does not store an App Engine db.Key value. The lack of a GAEKeyField in django storing db.Keys makes it impossible to create entity groups with the current App Engine backend.

The following App Engine properties can be emulated by using a CharField in Django:
  • CategoryProperty
  • LinkProperty
  • EmailProperty
  • IMProperty
  • PhoneNumberProperty
  • PostalAddressProperty
QuerySet methods
We do not support any filters that are not supported by App Engine itself. This means that you can't use all field lookup types from Django. Nevertheless you can use
  • __lt less than
  • __lte less than or equal to
  • __exact equal to
  • __gt greater than
  • __gte greater than or equal to
  • is_null
You can even use the exclude function in combination with the listed filters above. In addition you can use an IN-filter on primary keys and Q-objects to formulate more complex queries. Slicing is possible too.
But in all cases you have to keep App Engine restrictions in mind. While you can perform a filter like
Model.objects.exclude(Q(integer_field__lt=5) | Q(integer_field__gte=9))
you can't do
Model.objects.filter(Q(integer_field__lt=5) | Q(integer_field__gte=9))
The reason for this is that the first filter can be translated into an AND-filter without using a logical OR but the second one cannot.
Inequality filters are not supported for now. But many additional non return sets like count(), reverse(), ... do work.
Maybe you are wondering how to set a keyname for your model instance in Django. Just set the primary key of your entity and it will be used as the keyname.
model_instance.pk = u'keyname'
model_instance.save()
but remember to set the primary_key field option of the CharField to True.

Many Django features are not supported at the moment. A few of them are joins, many-to-many relations, multi table inheritance and aggregates. On the other hand there are App Engine features not supported by the Django port mainly because of the lack of corresponding features in Django. Two of them are entity groups and batch put.
But we plan to add some features for both sides while working on our own project. As the last point we do not support any Django transaction for the App Engine backend but we implemented a custom transaction decorator @commit_locked.

Differences in Django
itself

While changing the Django code itself we had to hack some parts in. Django should behave as described in the Django documentation in all situations except for one: when deleting a model instance the related objects will not be deleted. This had to be done because such a deletion process can take too much time.
In order to support deletion of related objects for non-relational Django in a clean manner Django should be modified to allow the backend to handle such a deletion process. For App Engine it can be done using background tasks.

Advantages using native Django

So maybe you are wondering why to use our current Django port instead of other projects providing you Django features on App Engine like our last project app-engine-patch. There are a few reasons for that.
First of all and maybe the most important advantage of using native Django is that you will avoid the vendor lock-in problem. I am not saying that you should not use App Engine but you can find yourself in a situation where App Engine is not sufficient enough for your needs. Using native Django code will allow you to switch to another hosting/cloud provider with having minimal costs related to porting the existent code to the new database. Second using native Django will result in almost full reusability of existing Django apps and makes non-relational Django apps fully portable to any platform (including SQL). Maybe these apps have to be modified a little bit but changing these apps is a lot easier than to port an App Engine project to Django.
In addition the model layer gives you many new features like select_related() which are not supported by other Django projects on App Engine. Another example is while using App Engine models we found it annoying to pass required fields to the model constructor. In Django you do not have to set required field types when creating a model instance. Only when saving the entity such fields have to be valid. So these are only a few reasons.

If you want to see more source code about all these things mentioned in this post take a look at our unit tests in djangoappengine. There you can find model definitions and queries such that you can get a better idea of how to use the App Engine backend for the non-relational Django port.

Help


A few days ago we cleaned up the non-relational django port itself. Now it should be easy to write a backend without modifying the django port. If you want to help improving our non-relational Django port, you can check out the wiki and the task list. We are happy about any other help too. You can help by testing the port, reaching out to other communities like MongoDB and SimpleDB, building a greater community around this project and also attracting developers who can contribute code.

While this post is mostly about how to use our port on App Engine it's possible to implement support for other non-relational databases using our non-relational Django port too. Again if you are planning to help, take a look at the djangoappengine repo to get an idea of how to implement your own non-relational database backend for Django and join the django non-relational group.

In order to help you write non-relational Django code we plan to write posts about how to write SQL-independent Django code and how to port existing app-engine-patch/webapp/Django apps to the new non-relational port.

So my first post got a little bit longer than i thought but i hope you enjoyed to read it. We'd love to hear your comments.

Thursday, January 7, 2010

First steps towards a Query backend interface

Our post about how to use the Django non-relational port and its supported features is already in the pipeline, but in the meantime I'd like to post a little update for those who want to contribute. In case you haven't heard of it, yet: the non-relational port adds support for Django's ORM (Model, QuerySet, etc.), so you can use the same code on App Engine, other cloud platforms, and on SQL.

You can find a short introduction to the internal architecture on our project wiki. Please read that before you continue.

Yesterday I finally moved out all hacked-in App Engine code into our djangoappengine package. This means that it's now possible to write real non-relational database backends for our Django port. On the long run it would be great to also have SimpleDB, MongoDB, and CouchDB support. Of course, the API is not set in stone, yet, so expect changes, especially in QueryData's internal format. If you want to implement a backend and get involved with the port you should fork the djangoappengine package and adapt the backends to your platform. I think you could write a simple backend in a few days. Beware, though, the djangoappengine code isn't clean, yet.

The new backend API

The backend's DatabaseOperations object (accessible via connection.ops) now provides a query_class() function which returns the QueryBackend class. This function is used by QuerySet and Model. Normally, you shouldn't override it. In order to specify the backend's Query class you just need to add a query_backend attribute to your DatabaseOperations like this:
class DatabaseOperations(BaseDatabaseOperations):
    query_backend = 'djangoappengine.db.backend.QueryBackend'

The backend itself then has to implement the database operations on top of QueryData. The BaseQueryBackend class provides a default constructor which takes the querydata instance and the connection instance and stores them in self.querydata and self.connection, respectively. Use them in your QueryBackend implementation. For example, the results_iter() function should convert self.querydata into the actual database query expression, execute it, and return the results. I won't go into the details, here, because the API is still not finished. Just look at the source if you want to know more.

The next step is to complete the low-level QueryBackend interface (e.g., QuerySet.update() isn't supported) and port Django's SQL layer to our backend and QueryData API. Yes, SQL is turned off right now. The incomplete SQL backend is in the django.db.models.sql.backend module. Once SQL is working, again, the Django team can comment on our QueryData solution and hopefully add non-relational backend support as a must-have feature for Django 1.3.

Since QueryData stores QuerySet function calls almost unmodified (except where it saves work for the backend), the translation to sql.Query shouldn't be very difficult, but it requires that you understand what's going on in the sql.Query class. QueryData copied a lot of code directly from that class, but modified the intermediate representation of the query expression to be more portable. The intermediate representation must be easily translatable by the sql.Query class and at the same time independent enough of SQL that you can write non-relational backends with it. If the representation is too simple we make the backends unnecessarily complicated. If the representation is too SQL-specific we can't built non-relational backends on top of it. That's the nut we have to crack. We're not sure, yet, if our current representation hits the sweet spot.

Finally, there are still a few TODOs in the Django source (search for "GAE"). For example, Django expects that when you delete an entity all related entities get deleted, too. This can take too much time, so the backend must be able to defer this process to a background task, for example. Also, multi-table inheritance isn't supported on non-relational backends. Currently it just raises an exception if you want to save/query a model on such a backend, but Django could alternatively implement the PolyModel concept to emulate multi-table inheritance. For now, we've just extended the backend API, so it can turn those features off for certain connections, but ideally these features should still be emulated somehow.

If you want to see more and faster progress you should help us. Fork our repositories (djangoappengine and django-nonrel-multidb), join our discussion group, and tell us what you're working on and feel free to ask for feedback.

Monday, January 4, 2010

An App Engine limitation you didn't know about

You probably know all of App Engine's documented datastore limitations. An important one is that a single entity can't have more than 5000 index entries. Usually this means that you have to be careful with queries that have multiple filters on a ListProperty (or StringListProperty, of course). If the query needs a composite index (in index.yaml) this can lead to an exploding index. Basically, when filtering on a ListProperty you should neither order() your query nor use inequality filters on any properties.

In other words, you're safe when you only use equality filters and no ordering. Right? Wrong, unfortunately. And this leads us to a limitation most people don't know about. In some cases you need a composite (and possibly exploding) datastore index even if you follow that rule! In fact, this is also true if you don't work with a ListProperty. It's just much less likely.

What are the circumstances that lead to this misery? This is the problem. There's no simple rule that you could follow. I'll try to explain it as well as I can.

Merge-join

When you run a query that only has equality filters the datastore (normally) doesn't need a composite index. Instead, it tries to match the filters by scanning each individual property's (automatically-generated) index. The matching range of each index is marked with a yellow bar on the left:

Each index entry consists of the property value(s) and the corresponding key. Within the matching range of each property index, the datastore jumps to the next matching entity key on each index until all individual indexes point to the same entity key:

This process is repeated until enough results are found. See Ryan Barrett's presentation and slides for a more detailed example of how the index and merge-joins work (the images are taken from his slides, BTW).

While this merge-join technique works well enough in most cases, it's possible that your data makes it very difficult to find a match on all property indexes. In this case, the datastore won't find enough results within a certain time limit and instead raise a NeedIndexError, suggesting that you create a composite datastore index in order to make the query efficient enough. But wait, you've probably never seen this exception in the situation I described, right? That's because small datasets, even if they're problematic, can be merge-joined quickly enough. The problem only appears with big datasets that have an index shape as described in this post.

Who's affected?

One application affected by this limitation is full-text search based on StringListProperty. This includes SearchableModel and almost all the other full-text search solutions for App Engine. The search terms seem to lead to a lot of merge-join situations that match some of the terms, but rarely all of them. The query gets slowed down by all those "almost" matches.

You could imagine the same problem with a query that has filters on two non-list properties if almost every second entity matched one filter and almost every other second entity matched the other filter. The merge-join would constantly jump to the next match on each index, but rarely find a full match on both indexes together.

From my experience, the problem starts to appear with full-text search at several 10,000s searchable documents (think 50,000 Wikipedia articles) in combination with queries containing at least five words. The bigger your index the fewer words are needed to cause the exception.

In case you see hope for alternative full-text search solutions that are not based on StringListProperty: they are still limited in scalability by other factors like the number of queries and BlobProperties you can process without causing deadline exceptions. However, I don't know how soon you'd hit those limits in practice.

Implications

How bad is the situation, really? First of all, only few queries and datasets have the properties needed for this problem. Very big datasets are more likely to be affected by it because your data might, at a large scale, look similar to the non-list example I gave above.

Still, in most cases the practical implication is that you don't have to care about it, at all. Yeah, you might need a composite index for your queries on non-list properties. So what? In fact, this is a valid optimization technique which you might want to use for big datasets, anyway.

You're only in trouble if you use queries with multiple filters on list properties and you expect your datastore to become big enough and a composite index would hit the 5000 index entries limit. This doesn't have to be the case if the ListProperty is small and you only have two or three filters on that property. It's also possible that the number of entities which could "almost" match each of your queries is relatively small and independent of the total number of entities in the datastore. For example, one of your filters could restrict the result set to a few 1000 entities and the remaining filters would match in that subset (e.g., "all documents created by user U and with label L and ...").

In any case, design your queries carefully when working with ListProperty.

Did you already run into this issue? Do you have a great example situation that explains the problem better? You should add your comment below! I'd love to hear your input.