100: Sorting and sub-grouping dictionary items with itemgetter and groupby

Often in data science we might use Pandas to store mixed text and numbers (Pandas then allows easy sorting and grouping), but sometimes you may want to stick to pure Python and use lists of dictionary items. Sorting and sub-grouping of lists of dictionary items may be performed in Python using itemgetter (to help sort) and groupby to sub-group.

Sorting lists of dictionary items with itemgetter

# Set up a list of dictionary items and add content

people = []
people.append({'born':1966, 'gender':'male', 'name':'Bob'})
people.append({'born':1966, 'gender':'female', 'name':'Anne'})
people.append({'born':1966, 'gender':'male', 'name':'Adam'})
people.append({'born':1970, 'gender':'male', 'name':'John'})
people.append({'born':1970, 'gender':'female', 'name':'Daisy'})
people.append({'born': 1968, 'gender':'male', 'name':'Steve'}) 

# import methods

from operator import itemgetter

# Sort by 'born' and 'name'
people.sort(key=itemgetter('born','name'))

# Print out sorted list
for item in people:
    print(item)

Output:

{'born': 1966, 'gender': 'male', 'name': 'Adam'}
{'born': 1966, 'gender': 'female', 'name': 'Anne'}
{'born': 1966, 'gender': 'male', 'name': 'Bob'}
{'born': 1968, 'gender': 'male', 'name': 'Steve'}
{'born': 1970, 'gender': 'female', 'name': 'Daisy'}
{'born': 1970, 'gender': 'male', 'name': 'John'}

Sub-grouping the sorted list of dictionary items with groupby

Note: We must always sort our list of dictionary items in the required order before sub-grouping them.

# Set up a list of dictionary items and add content
people = []
people.append({'born':1966, 'gender':'male', 'name':'Bob'})
people.append({'born':1966, 'gender':'female', 'name':'Anne'})
people.append({'born':1966, 'gender':'male', 'name':'Adam'})
people.append({'born':1970, 'gender':'male', 'name':'John'})
people.append({'born':1970, 'gender':'female', 'name':'Daisy'})
people.append({'born': 1968, 'gender':'male', 'name':'Steve'})

# import methods

from operator import itemgetter
from itertools import groupby

# First sort by required field
# Groupby only finds groups that are collected consecutively
people.sort(key=itemgetter('born'))

# Now iterate through groups (here we will group by the year born)
for born, items in groupby(people, key=itemgetter('born')):
    print (born)
    for i in items:
        print(' ', i)

Output:

1966
  {'born': 1966, 'gender': 'male', 'name': 'Bob'}
  {'born': 1966, 'gender': 'female', 'name': 'Anne'}
  {'born': 1966, 'gener': 'male', 'name': 'Adam'}
1968
  {'born': 1968, 'gender': 'male', 'name': 'Steve'}
1970
  {'born': 1970, 'gender': 'male', 'name': 'John'}
  {'born': 1970, 'gender': 'female', 'name': 'Daisy'}

99. Parallel processing functions and loops with dask ‘delayed’ method

For a full SciPy conference video on dask see: SciPy 2018

Dask is a Python library that allows parts of program to run in parallel in separate cpu threads to speed up the program.

Here we will look at using dask to run a normal function in parallel when we need to call the function more than once in one part of a program. We will mimic a slow function by using the Python sleep() method to make the function take on second each time it is run. Normally it would take 3 seconds to run this function 3 times, but here we will see that with dask all three calls to the function will be complete in one second (assuming you have at least a dual core, 4-thread cpu).

We will first import our required libraries.

from time import time # to time the program
from time import sleep # to mimic a slow function
from dask import delayed # to allow parallel computation

Next we define a normal function (there is no use of dask at this this point). Here we will write a function that returns the square of the number passed to the function, but we’ll add a 1 second sleep in it to mimic a longer running function.

# Define a function normally
def my_function(x):
    # mimic a slow function with sleep for 1 seconds
    sleep(1)
    return x*2

Now we will call that function three times.

Normally this would take three seconds as each function must complete before the next one can start. But by using the decorator ‘delayed’ we mark this as a function call that may be run in parallel with others.

Note the syntax amendment. We would normally call this function with my_function(x), but we amend the syntax to delayed(my_function)(x).

We then calculate the sum of the three returned numbers from our function. But when using dask this does not actually give us our answer. If we print the type of this object we see that it is a ‘delayed’ object. To get the actual result we must then use the .compute() method as shown below.

Then we see how long these three 1 second function calls take. If you have a processor with at least 2 CPUS and 4 threads you should see it takes close to one second rather than three!

# Record time at start of run
start = time()

# Run function in parallel when calling three times
# Syntax of my_function(x) is replaced with delayed(my_function)(x)
a = delayed(my_function)(1)
b = delayed(my_function)(2)
c = delayed(my_function)(3)

# Total will sum results. But at this point we generate a 'delayed' object   
total = a + b + c

# Show object type of total
print ('Object type of total', type(total))

# To get the result we must use 'compute':
final_result = total.compute()

# Calculate time taken and print results
time_taken = time()-start
print ('Process took %0.2f seconds' %time_taken)
print('Final result ',final_result)
Out:
Object type of total <class 'dask.delayed.Delayed'>
Process took 1.01 seconds
Final result  12

Using dask ‘delayed’ in a loop

We can also use dask delayed to parallel process data in a loop (so long as an iteration of the loop does not depend on previous results). Here we will call our function 10 times in a loop. Note the use of .compute again to get the actual result. This would take 10 seconds without dask. On a 4-cour/8-thread CPU it takes two seconds!

start = time()

# Example loop will add results to a list and calculate total
results = []
for i in range(10):
    # Call normal function with dask delayed decorator
    x = delayed(my_function)(i)
    results.append(x)

total = sum(results)
final_result = total.compute()

# Calculate time taken and print results
time_taken = time()-start
print ('Process took %0.2f seconds' %time_taken)
print('Final result ',final_result)
Out:
Process took 2.01 seconds
Final result  90

 

96. Passing arguments to Python from the command line (or other programs)

Many Python programs have all the variables defined within the Python code (or the Python code reads input files). It my be useful at times, however, to be able to pass one or more arguments to Python when calling the program from the command line or another programme. This is simple in Python. Below is an example (saved as ‘mycode.py’) that takes two arguments and multiplies them together:

import sys

x,y = int(sys.argv[1]), int(sys.argv[2])

print (x * y)

Now we can call that from the command line, passing our variables:

python mycode.py 3 4

>12

This type of code may allow us to write generic Python code, and call it as part of an automated sequence from another piece of code which passes the required variables to Python.

92. Queues

Python queues allow objects of any kind to be added to and removed from queues.

Here are three basic types of queue:

  1. FIFO (First In First Out)
  2. LIFO (Last In First Out)
  3. Priority (each item is assigned a priority in the queue)

Note: Python queues are not iterable; we cannot look through objects in the queue. It may at times be useful to maintain a separate list of queued items.

FIFO (First In First Out) queue

Here we will add numbers to a queue, but any Python object may be added to a queue.

# FIFO (First In First Out) queue

import queue
print ('FIFO (First In First Out) Queue')

q = queue.Queue()

# Add items to queue (add numbers 0 to 4)
for item in range(5):
    q.put(item) 

# Retrieve items from queue using a loop. 
# See additional method below of continually retrieving
# until queue is empty.
    
for item in range(5):
    x = q.get() # put item in queue
    print (x, end = ' ')

OUT:
FIFO (First In First Out) Queue
0 1 2 3 4

LIFO (Last In First Out) queue

# LIFO (Last In First Out) queue

import queue
    
print ('\nLIFO (Last In First Out) Queue')

q = queue.LifoQueue()

for item in range(5):
    q.put(item) # put item in queue
    
# Alternative way of emptying queue

while not q.empty():
    x = q.get() # put item in queue
    print (x, end = ' ')

OUT:
LIFO (Last In First Out) Queue
4 3 2 1 0 

Priority

In a priority queue an object is passed to the queue in a tuple. The first item of the tuple is the priority (lower number is higher priority by default), and the second item is the object to be queued. By default items of equal priority are handled as FIFO.

# Priority queue
import queue
import random
print ('\nPriority Queue')

q = queue.PriorityQueue()

for i in range(5):
    priority = random.randint(1,100)
    name = 'Item ' + str(i)
    item = (priority, name) # A tuple is used to pass priority and item
    q.put(item)

while not q.empty():
    x = q.get()
    print (x, end = ' ')

OUT:

Priority Queue
(15, 'Item 2') (47, 'Item 1') (52, 'Item 0') (54, 'Item 3') (81, 'Item 4')

91. Speed up Python by 1,000 times or more using numba!

If you have installed Python via anaconda (https://www.anaconda.com/) then numba is already available to you. Otherwise numba may be installed using pip (pip install numba).

Functions written in pure Python or NumPy may be speeded up by using the numba library and using the decorator @jit before a function. This is especially useful for loops where Python will normally compile to machine code (the language the CPU understands) for each iteration of the loop. Using numba the loop is compiled into machine code just once the first time it is called.

Let’s look at an example:

from numba import jit
import numpy as np
import timeit

# Define a function normally without using numba

def test_without_numba():
for i in np.arange(1000):
x = i ** 0.5
x *= 0.5

# Define a function using numba jit. Using the argument nopython=True gives the
# fastest possible run time, but will error if numba cannot precomplile all the
# code. Using just @jit will allow the code to mix pre-compiled and normal code
# but will not be as fast as possible

@jit(nopython=True)
def test_with_numba():
for i in np.arange(1000):
x = i ** 0.5
x *= 0.5

# Run functions first time without timing (compilation included in first run)
test_without_numba()
test_with_numba()

# Time functions with timeit (100 repeats).
# Multiply by 1000 to give milliseconds

timed = timeit.timeit(stmt = test_without_numba, number=100) * 1000
print ('Milliseconds without numba: %.3f' %timed)

timed = timeit.timeit(stmt = test_with_numba, number=100) * 1000
print ('Milliseconds with numba: %.3f' %timed)

OUTPUT:
Milliseconds without numba: 183.771
Milliseconds with numba: 0.025

We have a 7,000 fold increase in speed!!

Note: not all code will be speeded up by numba. Pandas for example are not helped by numba, and using numba will actually slow panda code down a little (because it looks for what can be pre-complied which takes time). So always test numba to see which functions it can speed up (and consider breaking larger functions down into smaller ones so that blocks that can use numba may be separated out).

If the default decorator @jit is used, with no other arguments, numba will allow a mix of code that can be pre-compiled with code that can’t. For the fastest execution use @jit(nopython=True), but you may need to break your function down because this mode will error if parts of the function cannot be pre-compiled by numba.

 

84. Function decorators

Decorators (identified by @ in line above function definition) allow code to be run before and after the function (hence ‘decorating’ it). An example of use of a decorator is shown below when a decorator function is used to time two different functions. This removes the need for duplicating code in different functions, and also keeps the functions focussed on their primary objective. Continue reading “84. Function decorators”