Logo

Django and Big Data Part 1 — Primary Keys

Only one blog post a year, geez… let’s try to change that…

Assumptions

You have a Django site and your data is growing, or you want to add data. At some point, the defaults are not going to be enough, but you’re not big enough to hire a DBA, and NoSQL isn’t right for you. What do you do? This series will explore some things you can do as your data grows to improve performance, scalability, and keep things running. Assumptions: You are using Django and the Django ORM, you are using PostgreSQL, and your data is growing to a point you are concerned about performance, disk space, or both.

Django Default Primary Keys

When you create a Django model, by default it adds a primary key. This primary key is a column of class AutoField, and it has the flag “primary_key”. Were you to define it explicitly in code, it would look like:

id = models.AutoField(primary_key=True)

A primary key by definition must be unique. This uniqueness is implemented via an index in PostgreSQL. If there wasn’t an index to enforce uniqueness, the database would have to scan every record to ensure that the value you were inserting was unique. Obviously, that’s not optimal. This implementation takes disk space, and even with the sequence generator, updating the index takes CPU and disk time. When you are looking for ways to reduce disk space, CPU, and disk time, this may be a penalty you don’t have to take.

Natural Keys

You might not need the primary key that Django creates. Your data might have a natural key. A natural key is a piece of data that is already unique that you can use as the primary key instead of an arbitrary auto-generated numeric sequence. If your data has a natural key, it probably doesn’t need the primary key that Django generates, and you can save the disk space and the CPU and disk time you would lose by having it.

Here are some examples of possible natural keys that in certain use cases can act as a unique primary key. However, in all the examples, there are cases you can think of where the example doesn’t make sense as a unique key, so you’ll need to look at your own data and determine what columns make sense for you. There are many more than the following:

  • Social Security Number
  • Timestamp
  • ISBN Number
  • ICAO Airport Code
  • Vehicle VIN Number
  • Another Table’s Primary Key (One-to-one table relationship)
  • A combination of data such as first name and last name

In order to stop using the the default primary key in Django and specify your own, all you have to do is add “primary_key=True” to the field you want to use. This field can be of any type, even a ForeignKey . However, if you add it to a ForeignKey, it often makes sense to change the definition from ForeignKey to OneToOneField). For example:

user = models.OneToOneField(User, primary_key=True)

or:

check_number = models.IntegerField(primary_key=True)

Potential Problems with Natural Keys

There are some database designers that insist that primary keys always be surrogate keys, that is, a column that isn’t directly associated with the data itself, such as an auto-generated sequence like the Django default primary key. The reason they say this because “all data can change, even if you don’t think it will, and it will change in a way that breaks your assumption of uniqueness”. For example, in theory SSNs are unique to an individual, and they have never been recycled. However, depending on your data source, you may have individuals using fake or misappropriated SSNs, the most famous one being 078-05-1120. How do you handle that in your model? Do you reject the duplicate, or is the duplicate the correct one? Only you can answer the question of if a field is appropriate as a natural key.

Another problem is using non-integer primary keys in Django. If you select a char field as primary key, Django will add both a standard index and an index with an operator type of varchar_pattern_ops, which is used for “like” and regular expression queries. You might want this capability, or you might not. If you don’t, and the field wasn’t previously indexed, you lose all the benefits of using that field as the primary key instead of a sequence. And unfortunately, there is no way to turn that off within Django.

Finally, Django doesn’t currently support multi-column (composite) primary keys, even though PostgreSQL does. Django developers understand that there is a need and have been actively looking into it. If your natural key is a composite of columns, it’ll be tough to make that work in Django. You can have unique constraints defined in Django, and you must have a primary key column defined. But you can’t make the unique constraint the primary key yet.

A Practical Example

I’ve been going through the ForecastWatch database looking for areas of performance improvement and disk savings. Over the past two years, the number of forecasts collected and analyzed each day has more than doubled, and we want to grow it even more. But the only way it can grow significantly from it’s currently nearly 1TB size and add a slew of new features cost-effectively is if we ensure it’s a lean, mean, fighting machine.

The ForecastWatch database has been around in its current form since 2005 and I started using Django pre-magic-removal. There are definitely things I would do differently today, and by going through the entire system I found a number of places where auto-generated primary keys were just taking up space.

The best example of this are the calculated forecast error tables. Each forecast in the forecasts table is compared to observations and error values are calculated and stored in “score” tables. There are five score tables, one each for high temperature, low temperature, precipitation, wind, and opacity or sky condition. These are broad tables, with some storing 30 or so calculated error metrics. These metrics are sometimes costly to compute so by storing them in a table, we are able to efficiently calculate aggregated error statistics and display individual forecast error metrics online quickly. For each forecast, there is one, and only one row in each of the score tables.

With forecasts from more than a dozen providers, from most countries in the world, for almost ten years, we now have hundreds of millions of forecasts in the database. Each of the score tables data size is about 20GB. On top of that, there were three indexes on each of the tables, each 7GB on disk:

  • The auto-generated id
  • The forecast id associated with the score
  • The observation id associated with the score

And the Django code looked something like this:

class HighScore(models.Model):
    # Data
    # ... fields for each of the calculated error metrics ...
 
    # Keys
    forecast = models.ForeignKey(Forecast)
    actual = models.ForeignKey(Actual)

Because a primary key isn’t explicitly specified, Django creates one for you. That’s the auto-generated id that’s the first bullet point above. And it’s completely unnecessary because each row has a one-to-one relationship with a forecast, so we can just use the forecast id as the primary key.

We can do that simply by replacing adding “primary_key=True” to the parameters of the forecast ForeignKey, and we can optionally change the field class of the forecast column from ForeignKey to OneToOneField. Using the OneToOneField makes it clearer the nature of the table relationship, and provides some niceties, such as only returning a single object if you query the reverse relationship (getting a score from a forecast). So the new code looks like:

class HighScore(models.Model):
    # Data
    # ... fields for each of the calculated error metrics ...
 
    # Keys
    forecast = models.OneToOneField(Forecast, primary_key=True)
    actual = models.ForeignKey(Actual)

This simple change netted a small but notable insert performance improvement, no decrease in select or update performance, and 35GB of disk space saved. If you are going to have big tables in your application, see if you can improve performance and decrease your disk requirements by looking for alternatives to an auto-generated id field, but don’t force your design into a bad architecture simply to avoid one, only change if it makes sense.


Comments RSS You can leave a response, or trackback from your own site.

One Response to “Django and Big Data Part 1 — Primary Keys”

  1. Tobias says:

    Thanks for the practical advice. I’m striving for such Django “best practices”. I’ve successfully removed in 3 tables the autoindex row and replaced it with a Foreign Key.


Leave a Reply