Rapid prototyping of APIs using Marshmallow

Marshmallow is a lightweight object serialization/de-serialization library, used for converting complex objects to and from native Python datatypes. By defining a marshmallow schema, app-level objects can be serialized to native python datatypes (before rendering to JSON), and input data validated and de-serialized to app-level objects.

Integrating Marshmallow with Flask allows APIs to be created quickly, while the dependencies are relatively light and it is easy to see what the code is doing. When building a prototype API, the aims will usually be to allow basic HTTP methods and CRUD operations, and to serve data in JSON format. The code should have the flexibility to allow it to easily be modified. This is largely because at the beginning we may not know exactly what is best needed.

A basic API could be created using Flask on it own, and an ORM e.g. SQLAlchemy to support a database. This is as in the following example of an endpoint which provides a specified number of recent air quality measurements from a site within a collection of monitoring sites:

from flask import Flask, jsonify
from flask_sqlalchemy import SQLAlchemy
from my_app.helpers import convert_to_dict

app = Flask(__name__)
db = SQLAlchemy()

class Site(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(100), unique=True)
    site_code = db.Column(db.String(10), unique=True)
    hourly_data = db.relationship('HourlyData', 
                      backref='owner', lazy='dynamic')

class HourlyData(DataMixin, db.Model):
    id = db.Column(db.Integer, primary_key=True)
    ozone = db.Column(db.String(10))
    no2 = db.Column(db.String(10))
    pm10 = db.Column(db.String(10))
    time = db.Column(db.String(20))
    site_id = db.Column(db.Integer, 

def site_aq_values(site_code, days):
    qs = HourlyData.query.join(Site).filter(
             Site.site_code == site_code.upper()).order_by(
    if qs:
        data_keys = ['o3', 'no2', 'so2', 'pm10']
        data_list = [{'time': a.time, 'values': 
        {b: getattr(a, b) for a in qs 
         for b in data_keys}} for a in qs]
        site_keys = ['name', 'region', ...]
        site_data =  {b: getattr(a, b) for a in qs for b                                  
                      in site_keys}} for a in qs
        all_data = {site_code.upper(): 
                        'aq_data': data_list,
                        'site_data: site_data}
        return jsonify(all_data)
    return jsonify({'message': 'no data'})

Clearly, manually converting models into dictionaries for each route would involve a lot of repetition, the code would be poorly readable, and it would take a long time to write. Marshmallow is ORM/framework agnostic and works with e.g. complex python dictionary structures as well as with various ORM objects. The SQLAlchemy model query in the example above can be serialized using to return the same response using Marshmallow:

from flask_marshmallow import Marshmallow
from app.models import HourlyData, Site

ma = Marshmallow()

class SiteSchema(ma.Schema):
    id = ma.Integer(dump_only=True)
    data = ma.Nested(HourlyDataSchema, 
                     allow_none=True, dump_only=True)
    url = ma.URLFor('Site.site_detail', id='<id>')
    user = ma.Function(lambda obj: obj.user.name)

    class Meta:
        additional = ('name', 'site_code', 'region', ...)
class HourlyDataSchema(ma.Schema):
    id = ma.Integer(dump_only=True)
    site_name = ma.Function(lambda obj: obj.owner.name)
    site_code = ma.Function(lambda obj: obj.owner.site_code)
    time = ma.Method(serialize='format_time')

    def format_time(self, obj):
        return '{}-{}-{} {}'.format(
            *obj.time.split(' ')[0].split('/')[::-1], 
            obj.time.split(' ')[1])

    class Meta: 
        additional = ('ozone', 'no2', 'so2', 'pm10') 
many_data_schema = HourlyDataSchema(many=True)

The schema can then be re-used across multiple views:

from app.schemas import current_hour_schema, many_data_schema, data_schema, hourlydata_schema, site_schema

hourly_data = Blueprint('hourly_data', __name__, url_prefix='/data')

def recent_aq():
    data = HourlyData.query.group_by(HourlyData.site_id)
    return current_hour_schema.jsonify(data)

def get_aq_data(site_code, number):
    data = HourlyData.query.join(Site).filter(
               Site.site_code == site_code.upper()).order_by(
    return jsonify({
    'site info': site_schema.dump(data[0].owner), 
    'aq data': many_data_schema.dump(data)

Python unit testing with Mock

Unit testing is used to check that a certain unit of code behaves as expected. This unit should have a narrow, well-defined scope and it is important that the units are tested in isolation, such as by stubbing or mocking interactions to the outside world. By testing individual units in isolation from external code they depend upon, failures in the code base can be more easily identified. To avoid individual tests breaking unnecessarily, this concept of keeping unit tests decoupled extends to e.g. assigning expected return values to the values of object attributes instead of ‘hard-coded’ values.

In the following example, the unittest.mock library allows a function to be tested in isolation from a helper function which appends a string with the date in a custom format. Here, the call to the helper function is mocked using the convenient patch decorator:

# views.login_view.py
from datetime import datetime

def suffix(day):
    if day in ['11', '12', '13']:
        return 'th'
    return {1: 'st', 2: 'nd', 3: 'rd'}.get(day, 'th')

def welcome_msg(greet):
    dt = datetime.now()
    day = str(dt.day) + suffix(dt.day)
    today = dt.strftime('{d} %B %Y').replace('{d}', day)
    return '{}. Today is {}'.format(greet, today)

# tests.py
from datetime import datetime
import unittest
from unittest.mock import patch
from views.login_view import welcome_msg

class LoginViewTestCase(unittest.TestCase):

    def test_greeting(self, suffix_patch):
        expected = '{}. Today is {}{} {}'.format(
            'Hello', datetime.now().day, 'th', 
            datetime.now().strftime('%B %Y'))
        self.assertRegex(expected, welcome_msg('Hello'))

In doing so, the mocked function has been replaced with a Mock object which was created by applying the decorator. When it is called, a Mock object will return its return_value attribute, which can easily be set but by default is a new Mock object.

It is desirable in many cases to test if and how many times a mocked callable is called. The boolean and integers values provided by call and call_count are useful for this:

mock = Mock(return_value=None)
a - mock.called
b = mock.called
c = mock.call_count

>>> print(a, b, c)
False True 2

side_effect can be set and this is useful for raising exceptions in order to test error handling:

from django.http import Http404

@patch('views.login_view.requests.get', side_effect=Http404)
def test_my_func_raises_http_exception(self, my_patch):
    with self.assertRaises(Http404):

It is also useful where your mock is going to be called several times, and you want each call to return a different value:

def adder(val):
    return val + 5

def adder_squared():
    return adder(a) ** 2

@patch('adder', side_effect=[1, 2])
def test_repeat_caller(self, test_patch):
    resp = adder_squared()
    self.assertEqual(resp, 1)
    resp2 = adder_squared()
    self.assertEqual(resp2, 4)

Lazy evaluation of Django ORM objects

When you create a Django QuerySet object, no database activity occurs until you do something to evaluate the queryset. Evaluation is forced by the following: iterating over it,  calling len() or list(), slicing it with the ‘step’ parameter, or testing it in a boolean context.

How data is held in memory

When a queryset is created the cache is empty. When it is evaluated and database interaction occurs, the results of the query are stored and the requested results are returned. In many cases the queryset should be stored and re-used instead of consuming, in order to avoid unnecessary database lookups:

entry = Entry.objects.get(id=1)
entry.blog  # Blog object is retried
entry.blog  # cached version, no DB lookup

entry.authors.all()    #query performed
entry.authors.all()    #query performed again

As caching objects can involve significant memory usage, if the queryset will not need to be re-used sometimes then there is no need for it to be cached. As well as caching of querysets, there is also caching of attributes of ORM objects. In general, attributes of ORM objects that are not callable will be cached whereas callable attributes cause DB lookups every time.

Retrieve everything you need in one hit

But not the things you don’t need. Using QuerySet.values() can signficantly reduce the overhead from a database lookup and is useful where when you just need a dictionary or a list of the values, not the ORM model objects. QuerySet.select_related() is useful for lookups spanning multiple tables:

class Album(models.Model):
    title = models.CharField(max_length=50)
    year = models.IntegerField()

    name = models.CharField(max_length=50)
    album = models.ForeignKey(Album)

song = Song.objects.get(id=5) # query performed
album =  song.album # query performed again

song = Song.objects.select_related(‘album’).get(id=5)
song.album # database query not required

QuerySet.select_related() works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to FK and one-to-one relationships.

QuerySet.prefetch_related() serves a similar purpose, but the strategy is quite different. It does a separate lookup for each relationship, and does the “joining” in Python.  This allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related.

Plotting live data using Highcharts and a REST API

Highcharts is a JavaScript charting framework, similar to D3.js, plotly.js and Google Charts. It enables the creation of various types of interactive charts which can easily be integrated on a web site.

The King’s College London API provides live air quality data for sites across London. This REST API exposes data from the database in either JSON or XML. Calling the API returns data in JSON format (as opposed to HTML), allowing the data to be directly used in Python. The following chart was created using this API together with HighCharts and Flask.

Flask is used since HighCharts is written in HTML5/JavaScript and therefore requires a web browser.   The code for this web app is contained within this GitHub repository:  https://github.com/paulos84/airapp3

Within the charts.py file in the views directory, the get_json function returns a dictionary of air quality monitoring data requested from the London Air API. The function takes in values which specify the site and number of previous days data the user is interested in. String formatting is then used to generate the desired endpoint as a string which is passed to the requests get method.

Before the requests library was released, sending HTTP requests relied upon the verbose and cumbersome urllib2 library. The requests library greatly reduces the lines of code needed and is well suited to making RESTful API calls. The get method requires a URL as an argument and allows you to pass optional parameters such as http request headers (e.g. login credentials). Requests built-in JSON decoder, called by request.json(), converts the JSON response into a Python dictionary, which in this case contains many layers of nesting.

The get_data function uses list comprehensions to create lists of pollutant values and the hours (for the x and y axes). To avoid any KeyErrors, empty strings are returned instead of None for missing data points. The get_data function passes a dictionary of these lists to the make_chart function which has a decorator specifying the url. By providing ‘detail.html’ as a positional argument, Flasks render_template method passes the key-values pairs required by HighCharts in order create the desired chart. This html template containing the HighCharts JavaScript code is contained within the templates directory.

Dictionaries in Python

Python’s way of storing key-value pairs, a fundamental data structure in computer science. The data type is summarized in the official documentation as “an unordered set of key: value pairs, with the requirement that the keys are unique”. Dictionaries can be indexed by any immutable data type and the stored values accessed in the following ways:

value = d.get[key]

value = d.get(key)

value = d.get(key, "no data")

Whereas using [key] will return a KeyError if the key does not exist, the .get method will either return None, or a default value if specified as an optional parameter. Values within nested dictionaries, such as deserialized JSON data, can be accessed by the successive use of [key] or .get(key):

sales = {'data':{'orders':{'january':240}}}




The following are all valid ways of creating dictionaries:

my_dict = {'key1': 'value1', 'key2': 'value2'}

my_dict = dict(key1='value1',key2='value2')

my_dict = {x: x**2 for x in values}

my_dict = dict(zip(keys, values))

When the keys are simple strings, it can be useful to pass in the keys as keywords to the dict() constructor. This is the most performant way of creating dictionaries and useful for the generation of arbitrary keys and values. Using the zip function inside the dict() constructor is particularly useful for creating dictionaries from lists of keys and values.

Dictionaries are unordered, except in Python 3.6+. To store the insertion order of keys, the dictionary sub-class OrderedDict can be used after importing it from the collections module in the standard library.

Data visualization libraries for Python

Matplotlib and pandas (a library built on top of NumPy) are a powerful combination for processing and plotting data. The default plotting styles of matplotlib are somewhat basic, but with recent versions the aesthetics can be improved using the style sub-package. A list of available styles can be obtained using the style.available attribute:

from matplotlib import pyplot as plt, style
>>> print (plt.style.available)
['seaborn-deep', 'seaborn-dark', 'fivethirtyeight', 'dark_background', 'seaborn-colorblind', 'seaborn-bright', 'seaborn-notebook', 'seaborn-whitegrid', 'seaborn-dark-palette', 'seaborn-ticks', 'seaborn-pastel', 'seaborn-poster', 'classic', 'seaborn-white', 'grayscale', 'seaborn-paper', 'seaborn-muted', 'seaborn-talk', 'ggplot', 'seaborn-darkgrid', 'bmh']

Then just call style.use() within the code used to generate a plot:


Seaborn is a library built on top of matplotlib. It provides various useful plotting functions and the plots it produces tend to be visually attractive. Seaborn is especially useful for exploring statistical data and for use with more complex data sets.

The choice of library should largely depend upon the desired visualization. Matplotlib on its own is very powerful and should be used for simple bar, line, pie, scatter plots etc. More complicated plots will require significantly more lines of code and seaborn will usually be more appropriate in these cases.

Bokeh was created with the aim of providing attractive and interactive plots in the style of the JavaScript D3.js library. Since Bokeh is higher level than D3.js, interactive visualizations can generally be created with much less effort. The documentation is fairly comprehensive, however the library is still under heavy development so may best be avoided if future compatibility is a potential issue.

Avoiding multi-table inheritance in Django Models

Model inheritance does not have a natural translation to relational database architecture and so models in Django should be designed in order to avoid impact on database performance. When there is no need for the base model to be translated into a table abstract inheritance should be used instead of multi-table inheritance.

Given the following model:

class Person(Model):
  name = CharField()

class Employee(Person):
  department = CharField()

Two tables will be created and what looks like a simple query to the Employee child class will actually involve a join automatically being created. The same example with abstract = True in the Meta class allows abstract inheritance:

class Person(Model):
  name = CharField()

class Meta:
  abstract = True

class Employee(Person):
  department = CharField()

By putting abstract = True, the extra table for the base model is not created and the fields within the base model are automatically created for each child model. This avoids unnecessary joins being created to access those fields. This way of using model inheritance also avoids repetition of code within the child classes.

PyCharm Auto-import works differently to PHPStorm and IntelliJ

Having become used to developing in JetBrain’s PHPStorm & IntellliJ IDEs it nows seems tedious to break out of the programming flow to manually type out imports every time we introduce a new dependency.

However in that company’s Python IDE, PyCharm, auto-complete works differently. The:

  [ctrl] + [space]

keyboard shortcut still auto-suggests but doesn’t include non-imported classes, but the

  [ctrl] + [alt] + [space]

keyboard shortcut does! Displaying all available classes and auto-generating the import statement for you, just like in JetBrains other IDEs.

Quickly get memcached working in Python Django

As with most frameworks, the Django framework for Python can make use of caching to greatly improve performance for many common requests. Here we will look at using memcached as it enjoys good Django support and production use although there is also Redis support which definitely improves on memcached in some aspects such as data persistence.

  1. The first step is to install memcached on your server:
  2. RedHat Linux:

    yum install memcached

    Ubuntu / Debian Linux:

    apt-get install memcached
  3. Let Django know how to access memcached:
  4. In Django’s settings.py file, add the following line:

    'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache'
  5. Load the cache within your application
  6. from django.core.cache import cache
  7. Save the value to the cache
  8. cache.set('exampleValue',exampleValue)
  9. Retrieve the value from the cache
  10. exampleValue = cache.get('exampleValue')

The beauty being that exampleValue can be anything from a computed / database retrieved value to large blocks of static text or a URL etc.

The only problem with caches is they don’t always contain the data you expect, what if the value got flushed or hasn’t yet been stored? Lets rewrite step 5 to handle the event of the value not being available in the cache:

exampleValue = cache.get('exampleValue')
if not exampleValue:
     exampleValue = exampleValueLookup

Here we see the value exampleValue being retrieved with a backup regeneration if the value has not been set. In a real application this would usually be encapsulated in a getExampleValue function or somewhere appropriate.