Logo

Archive for the ‘Python’ Category

Django and Big Data Part 1 — Primary Keys

Only one blog post a year, geez… let’s try to change that…

Assumptions

You have a Django site and your data is growing, or you want to add data. At some point, the defaults are not going to be enough, but you’re not big enough to hire a DBA, and NoSQL isn’t right for you. What do you do? This series will explore some things you can do as your data grows to improve performance, scalability, and keep things running. Assumptions: You are using Django and the Django ORM, you are using PostgreSQL, and your data is growing to a point you are concerned about performance, disk space, or both.

Django Default Primary Keys

When you create a Django model, by default it adds a primary key. This primary key is a column of class AutoField, and it has the flag “primary_key”. Were you to define it explicitly in code, it would look like:

id = models.AutoField(primary_key=True)

A primary key by definition must be unique. This uniqueness is implemented via an index in PostgreSQL. If there wasn’t an index to enforce uniqueness, the database would have to scan every record to ensure that the value you were inserting was unique. Obviously, that’s not optimal. This implementation takes disk space, and even with the sequence generator, updating the index takes CPU and disk time. When you are looking for ways to reduce disk space, CPU, and disk time, this may be a penalty you don’t have to take.

Natural Keys

You might not need the primary key that Django creates. Your data might have a natural key. A natural key is a piece of data that is already unique that you can use as the primary key instead of an arbitrary auto-generated numeric sequence. If your data has a natural key, it probably doesn’t need the primary key that Django generates, and you can save the disk space and the CPU and disk time you would lose by having it.

Here are some examples of possible natural keys that in certain use cases can act as a unique primary key. However, in all the examples, there are cases you can think of where the example doesn’t make sense as a unique key, so you’ll need to look at your own data and determine what columns make sense for you. There are many more than the following:

  • Social Security Number
  • Timestamp
  • ISBN Number
  • ICAO Airport Code
  • Vehicle VIN Number
  • Another Table’s Primary Key (One-to-one table relationship)
  • A combination of data such as first name and last name

In order to stop using the the default primary key in Django and specify your own, all you have to do is add “primary_key=True” to the field you want to use. This field can be of any type, even a ForeignKey . However, if you add it to a ForeignKey, it often makes sense to change the definition from ForeignKey to OneToOneField). For example:

user = models.OneToOneField(User, primary_key=True)

or:

check_number = models.IntegerField(primary_key=True)

Potential Problems with Natural Keys

There are some database designers that insist that primary keys always be surrogate keys, that is, a column that isn’t directly associated with the data itself, such as an auto-generated sequence like the Django default primary key. The reason they say this because “all data can change, even if you don’t think it will, and it will change in a way that breaks your assumption of uniqueness”. For example, in theory SSNs are unique to an individual, and they have never been recycled. However, depending on your data source, you may have individuals using fake or misappropriated SSNs, the most famous one being 078-05-1120. How do you handle that in your model? Do you reject the duplicate, or is the duplicate the correct one? Only you can answer the question of if a field is appropriate as a natural key.

Another problem is using non-integer primary keys in Django. If you select a char field as primary key, Django will add both a standard index and an index with an operator type of varchar_pattern_ops, which is used for “like” and regular expression queries. You might want this capability, or you might not. If you don’t, and the field wasn’t previously indexed, you lose all the benefits of using that field as the primary key instead of a sequence. And unfortunately, there is no way to turn that off within Django.

Finally, Django doesn’t currently support multi-column (composite) primary keys, even though PostgreSQL does. Django developers understand that there is a need and have been actively looking into it. If your natural key is a composite of columns, it’ll be tough to make that work in Django. You can have unique constraints defined in Django, and you must have a primary key column defined. But you can’t make the unique constraint the primary key yet.

A Practical Example

I’ve been going through the ForecastWatch database looking for areas of performance improvement and disk savings. Over the past two years, the number of forecasts collected and analyzed each day has more than doubled, and we want to grow it even more. But the only way it can grow significantly from it’s currently nearly 1TB size and add a slew of new features cost-effectively is if we ensure it’s a lean, mean, fighting machine.

The ForecastWatch database has been around in its current form since 2005 and I started using Django pre-magic-removal. There are definitely things I would do differently today, and by going through the entire system I found a number of places where auto-generated primary keys were just taking up space.

The best example of this are the calculated forecast error tables. Each forecast in the forecasts table is compared to observations and error values are calculated and stored in “score” tables. There are five score tables, one each for high temperature, low temperature, precipitation, wind, and opacity or sky condition. These are broad tables, with some storing 30 or so calculated error metrics. These metrics are sometimes costly to compute so by storing them in a table, we are able to efficiently calculate aggregated error statistics and display individual forecast error metrics online quickly. For each forecast, there is one, and only one row in each of the score tables.

With forecasts from more than a dozen providers, from most countries in the world, for almost ten years, we now have hundreds of millions of forecasts in the database. Each of the score tables data size is about 20GB. On top of that, there were three indexes on each of the tables, each 7GB on disk:

  • The auto-generated id
  • The forecast id associated with the score
  • The observation id associated with the score

And the Django code looked something like this:

class HighScore(models.Model):
    # Data
    # ... fields for each of the calculated error metrics ...
 
    # Keys
    forecast = models.ForeignKey(Forecast)
    actual = models.ForeignKey(Actual)

Because a primary key isn’t explicitly specified, Django creates one for you. That’s the auto-generated id that’s the first bullet point above. And it’s completely unnecessary because each row has a one-to-one relationship with a forecast, so we can just use the forecast id as the primary key.

We can do that simply by replacing adding “primary_key=True” to the parameters of the forecast ForeignKey, and we can optionally change the field class of the forecast column from ForeignKey to OneToOneField. Using the OneToOneField makes it clearer the nature of the table relationship, and provides some niceties, such as only returning a single object if you query the reverse relationship (getting a score from a forecast). So the new code looks like:

class HighScore(models.Model):
    # Data
    # ... fields for each of the calculated error metrics ...
 
    # Keys
    forecast = models.OneToOneField(Forecast, primary_key=True)
    actual = models.ForeignKey(Actual)

This simple change netted a small but notable insert performance improvement, no decrease in select or update performance, and 35GB of disk space saved. If you are going to have big tables in your application, see if you can improve performance and decrease your disk requirements by looking for alternatives to an auto-generated id field, but don’t force your design into a bad architecture simply to avoid one, only change if it makes sense.


Wind Observation Calculations in FORTRAN and Python

The National Climatic Data Center (NCDC) is the government organization in charge of preserving and archiving weather observations in the United States. One of the most popular datasets they provide is the Local Climatic Data (LCD) product. It contains daily summaries and raw hourly reports from about 1,600 stations. The daily summary includes high temperature, low temperature, amount of precipitation, atmospheric pressure, and wind summaries. An example can be found here.

The daily summary has several wind observations. There are maximum five-second and two-minute speeds, and the average wind direction during those gusts. And then there are three aggregate observations: “Resultant wind speed”, “Resultant wind direction”, and “Average wind speed”. What do these observations mean, and what is the difference between “resultant” and “average”?

The average wind speed is a simple average of wind velocity, regardless of direction, over the day. If the wind blew from the north at 10 knots for half the day, and from the south at 20 knots for the other half, the average wind speed that day would be 15 knots.

Resultant wind speed and resultant wind direction are the vector average (the resultant vector) wind speed and direction. Imagine the wind observation as a vector starting from the origin (the observing station) pointing in the direction the wind is originating from and with a length equal to the wind speed. If you take all those vectors and add them up, the resulting direction of the vector would be the resultant wind direction, and the magnitude of the resulting vector divided by the number of vectors is the resultant wind speed.

Adding all of the hourly wind vectors and averaging them results in the values expressed in the daily resultant wind speed and resultant wind direction observations. That is how the NCDC calculates those values, and they do it in FORTRAN. I am going to walk through the specific code that generates those daily averages from hourly data. I’ll also show the same code in Python. Finally, using the complex math library in the Python standard library, I’ll show how the vector average can be calculated with far less code. The full FORTRAN example can be downloaded here, while the Python source can be found here.

The following FORTRAN code was supplied by the NCDC, and is the actual code they use to compute daily resultant wind direction and speed. The Python code was created by me to exactly match the FORTRAN code, to compare and contrast the differences between FORTRAN and Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
      SUBROUTINE R_WIND (N_RWD, RRDD, RWND, RWDIR, RWSPD)
 
      IMPLICIT NONE
 
      REAL*8 RRDD(31), RWND(31), R_DIR, R_SPD, TOTRX, TOTRY
 
      INTEGER X, WI1, WI2, N_RWD
 
      CHARACTER*2 RWDIR
      CHARACTER*5 RWSPD, WC1
 
      TOTRX = 0.0
      TOTRY = 0.0

The subroutine R_WIND takes three parameters, and returns two. N_RWD is the number of hourly wind observations. RRDD is the array of wind directions, while RWND is the array of wind speeds. The wind direction angles passed in are degrees plus 180 converted into radians.

The subroutine returns two results, RWDIR, which is the resultant wind direction, and RWSPD, which is the resultant wind speed. These values are returned as character strings.

The Python equivalent is much shorter in code, because in Python you don’t need to explicitly declare variable types. You do have to import the math library, since the sine and cosine functions that will be used are not implicitly available, as with FORTRAN. There are also no explicitly declared output variables. Because of this dynamic nature, documentation is even more important. The text between triple-quotes immediately following the function definition is called a “docstring.” A docstring is used in Python to document the code, and is used by documentation generators like Sphinx to build web-based documentation for your code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import math
 
def r_wind(n_rwd, rrdd, rwnd):
    """
    Takes an array of houry wind directions and speeds and returns the
    resultant average wind speed and direction, or None for speed and direction
    if there are fewer than 16 hourly observations.
 
    n_rwd -- number of hourly wind observations
    rrdd  -- array of wind directions plus 180 degrees, in radians
    rwnd  -- array of wind speeds
 
    Returns a tuple of resultant wind direction and wind speed.  Both are
    returned as strings.  Wind direction is in tens of degrees.
    """"
    rwdir = rwspd = None
    totrx = 0.0
    totry = 0.0

Once the variables have been declared, the subroutine then checks to see if there are at least 16 hourly observations. If there are not at least 16 hourly observations out of 24 (or two-thirds), it won’t try to calculate a daily resultant wind vector. Then, it loops through the direction and speed arrays, converting into Cartesian coordinates. Converting polar (angle and magnitude, in this case direction and speed) coordinates into Cartesian coordinates to sum them up, then average, is the normal way to add vectors.

15
16
17
18
19
20
21
      IF (N_RWD .GT. 15) THEN
         X = 1
         DO WHILE (X .LE. N_RWD)
            TOTRX = (RWND(X)*(DSIN(RRDD(X)))) + TOTRX  ! TOT X & Y COOR.
            TOTRY = (RWND(X)*(DCOS(RRDD(X)))) + TOTRY
            X = X + 1
         END DO

There are a couple of unique things happening in this code. First, adding 180 degrees to the wind directions prior to passing in the array. Second, using sine for the X coordinate conversion and cosine for the Y coordinate conversion is the opposite of the general way to convert from polar to Cartesian coordinates (see here). The reason this is done, even though it isn’t required in this subroutine, is to create a Cartesian plane that matches a map. When wind is measured at 0 degrees, it means it is coming out of the north. The visualized vector would be an arrow pointing down (the negative y-axis in a standard Cartesian plane). Additionally, the angle would be measured clockwise (a 90 degree wind would be coming from the east, and thus a vector pointing left along the negative x-axis). Using the standard Cartesian coordinate conversion, the zero angle vector would point right along the positive x-axis, with angles measured counter-clockwise.

20
21
22
23
24
25
    if n_rwd > 15:
        x = 0
        while x < n_rwd:
            totrx = (rwnd[x]*(math.sin(rrdd[x]))) + totrx # tot x & y coor.
            totry = (rwnd[x]*(math.cos(rrdd[x]))) + totry
            x = x + 1

The Python version is quite similar. The biggest difference is that Python arrays are zero indexed, while FORTRAN has arrays indexed starting at one. Another difference is that Python arrays use brackets to indicate the index, with parentheses reserved for function calls. Also, no “end while” statement is required, because in Python, indentation has syntactical meaning. Specficially, anything indented past the while statement are considered part of the while block. Once code returns to the same indentation as the while, it is not longer part of the while block. Finally, other differences are that in Python, parentheses are not required around conditional expressions, and comments in Python are indicated with a hash character rather than a bang.

The next section of the FORTRAN code finds the angle of the resultant x,y coordinates and averages the resultant length (which is speed).

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
         R_DIR = DATAN(TOTRX/TOTRY) ! AVG R_WIND DIR
         R_DIR = R_DIR / (3.14156/180)   ! convert radians back to degrees
 
         IF ((TOTRX*TOTRY) .LT. 0) THEN
            IF (TOTRX .LT. 0) THEN
               R_DIR = 180 + R_DIR
            ELSE
               R_DIR = 360 + R_DIR
            END IF
         ELSE
            IF (TOTRX .GT. 0) R_DIR = 180 + R_DIR
         END IF
 
         R_SPD = (((TOTRX**2) + (TOTRY**2)) / N_RWD**2)**.5
 
         IF (R_DIR .EQ. 0 .AND. R_SPD .NE. 0) R_DIR = 360
         IF (R_SPD .EQ. 0) R_DIR = 0

Lines 23 and 24 take the arctangent of the x,y coordinate to compute the angle to convert back into polar coordinates. Line 27-35 convert the angle to the original solution space, based on the quadrant the resultant x,y coordinate resides in. Line 37 is simply the distance between the origin and the averaged x,y coordinate, which is the vector length of the averaged resultant vector, which is resultant wind speed. This is calculated by realizing that x and y are just legs of a right triangle, with hypotenuse being the distance between the origin and the point. Line 39 and 40 are just clean up. The subroutine will return an angle of 360 for a north wind, and zero only if there is resultant wind. Line 40 is required because the DATAN function will return NaN (Not A Number) if x and y both sum to zero, which is the only time the speed would be zero.

The Python version is quite similar. The only major difference is that there needs to be a check that totry isn’t zero, as that would result in a ZeroDivisionError Python exception. When a division by zero is encountered in Python, it will interrupt code flow and throw an exception. FORTRAN on the other hand returns the discrete value NaN which is then passed to the DATAN function, which knows how to handle NaN. There actually is a NaN in Python, but Python chooses to throw an exception rather than return NaN on a division by zero.

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
        if totry == 0.0:
            r_dir = 0
        else:
            r_dir = math.atan(totrx/totry) # avg r_wind dir
        r_dir = r_dir / (3.14156/180) # convert radians back to degrees
 
        if (totrx*totry) < 0:
            if totrx < 0:
                r_dir = 180 + r_dir
            else:
                r_dir = 360 + r_dir
        else:
            if totrx > 0:
                r_dir = 180 + r_dir
 
        r_spd = ((totrx**2 + totry**2) / n_rwd**2)**.5
 
        if r_dir == 0 and r_spd != 0:
            r_dir = 360
 
        if r_spd == 0:
            r_dir = 0

The final section of the subroutine just rounds and formats. It rounds the degrees to tens of degrees, and returns in tens of degrees. Thus, 260 degrees would be returned as 26. Speed is returned as tenths of a knot. The values are returned as strings, as they will be ending up in a CSV file, not used in further processing.

41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
         WRITE (WC1, FMT='(F4.0)') R_DIR
 
         READ (WC1, 10) WI1, WI2
 10      FORMAT (I2, I1)
 
         IF (WI2 .GT. 4) THEN
            WI1 = WI1 + 1
         END IF
 
         WRITE (RWDIR, FMT='(I2)') WI1
         WRITE (RWSPD, FMT='(F5.1)') R_SPD
      END IF
 
      RETURN
      END

The Python version is noticeably different as formatting and substring extraction is quite different between FORTRAN and Python. Python uses C-style format strings. The FORTRAN code is also using CHARACTER arrays rather than STRING arrays (perhaps when this was written the FORTRAN dialect used did not support STRING), which doesn’t support substring notation. Python strings are character arrays and provide all the same functionality as other arrays, including sub-array indexing, which is used here. Finally, in Python don’t declare the return variables in the function declaration, you return them explicitly, and anonymously, which gives the programmer a lot of power, but also adds to the responsibility of the programmer to document what is returned in docstrings.

58
59
60
61
62
63
64
65
66
67
68
        wc1 = '%#4.0f' % r_dir
        wi1 = int(wc1[0:2])
        wi2 = int(wc1[2:3])
 
        if wi2 > 4:
            wi1 = wi1 + 1
 
        rwdir = '%2i' % wi1
        rwspd = '%#5.1f' % r_spd
 
    return rwdir, rwspd

That’s it. Of course, if we were to rewrite this function today in Python, we wouldn’t need to do all of this calculation in our code, but we can instead rely on higher level functions in the standard library to help us out. This is probably also true of FORTRAN today, but I’ve not actively written FORTRAN in many years, so don’t know.

Python has a complex math library, called cmath that we can import in order to give us the functionality we need. We can convert polar coordinates into a Cartesian complex number, add up all the complex numbers and average them, then convert the complex number back to polar coordinates. This will accomplish the task of the original function, which is to return the resultant wind vector angle and speed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def r_wind_new(n_rwd, rrdd, rwnd):
    wind_vector_sum = None
 
    if n_rwd > 15:
        for i in range(0, n_rwd):
            wind_polar = cmath.rect(rwnd[i], rrdd[i] - math.pi)
            if wind_vector_sum is None:
                wind_vector_sum = wind_polar
            else:
                wind_vector_sum += wind_polar
 
        r, phi = cmath.polar(wind_vector_sum / n_rwd)
 
        rwdir = int(round(int(round(math.degrees(phi) % 360)) / 10.0))
        rwspd = int(round(r*10))/10.0
    else:
        return None, None
 
    return '%2i' % rwdir, '%#5.1f' % rwspd

This new Python function, r_wind_new is a drop-in replacement for the original r_wind function. It uses cmath.rect() to convert a vector angle and length to a Cartesian complex number. These complex numbers are added together and then averaged. Then, cmath.polar() is used to convert the complex number into an angle and length, which is the resultant wind vector.