Python itertools with izip and count
A common usage scenario in Python occurs when retrieving records from a database and adding them into an appropriate data structure. One such structure involves a list of dictionaries, with each record in the DB represented by an element of the list, with a key-value pair for each field in the dictionary. Here is an example for a stock price time series:
It is often necessary to compare items in two of these lists. In this instance the stock price could be compared to another asset on each day to see which one is higher. An alternative consideration might be to assess whether a stock is in a portfolio or not on a particular day.
The naive approach to solve this list comparison task is to run the comparison within a nested loop iterating across both lists:
This is a lot of computational effort to compare two equal sized data structures that are already ordered in the correct fashion. Python provides a far speedier alternative with the zip() function. The function accepts two lists and produces a pairwise list of tuples of their elements. If the size of the original lists is N, then the iteration over the zipped data structure reduces the comparison from an O(N^2) operation to O(N).
So far I've just discussed a simple case of list comparison via the zip function. Nothing crazy as of yet. Although, the operation has reduced the computational time significantly. The fun doesn't stop there though. We can bring in the itertools library to provide a further speed increase. Itertools ships with two functions, izip and count, which can speed up this operation even further. The following code provides the same result as the above, but in less time and with less memory usage:
I've run tests on actual financial data and have seen each of the above three scenarios reduce respectively from a 30 second process to 1s and then finally 0.7s. So, next you time you feel yourself reaching for the zip function, consider using izip instead.
If anybody believes they can speed this operation up further, either by an alternative data structure approach or with further refinement of the iteration, I'm keen to hear suggestions.
