In the first part of this series, we saw how to build a Julia fractal Set using Python, matplotlib and numpy. In today’s post, we will take the code we built and improve its performance by parallelizing the computations using Python’s multiprocessing module.

## Data Parallel Problems

My friend has given a good explanation of Concurrency vs Parallelism in a previous post in this blog. So I will not dwelve on this aspect a lot. Instead let us take a quick look at what a data parallel problem is.

A data parallel problem is one where the computation can be parallelized along an axis or dimension in the data structure so that the calculations can be computed parallely in n nodes or n CPUs and the results combined together to form the final output. This will usually offer a speed-up by F(n) where F is an arbitrary function.

Many data structures such as arrays and matrices lend themselves to such computations.

Let us take a quick look at a well known data parallel problem - Matrix Multiplication.

### Matrix Multiplication

As you know, we can multiply a matrix M1 of order $P x Q$ with another matrix M2 of order say $Q x R$. In other words, the number of rows of M2 should be equal to the number of columns of M1. The result would be a matrix M3 of order $P x R$

The following example illustrates this. We multiply a $3 x 3$ matrix with a $3 x 2$ matrix to get the result, another $3 x 2$ matrix.

$$\left[ \begin{array}{c} p & q & r\newline s & t & u\newline v & w & x\newline \end{array} \right] \times \left[ \begin{array}{cc} a & d\newline b & e\newline c & f\newline \end{array} \right] == \begin{bmatrix} pa + qb + rc & pd + qe + rf\newline sa + tb + uc & sd + te + uf\newline va + wb + xc & vd + we + xf\end{bmatrix}$$

The multiplication can be expressed by the pseudo-code (courtesy: Wikipedia)

M3 = make_matrix(M1.num_rows, M2.num_columns)

for i in range(M1.num_rows):
for k in range(M2.num_columns):
sum = 0
for j in range(M1.num_columns):
sum += M1[i][j]*M2[j][k]
M3[i][k]=sum;


In this case, since each row of the resultant matrix can be computed independently of any other row, the computation is data parallel. We can write code so that each row’s computation is done in a separate compute node or CPU offering us a parallelism and speedup as a funciton of P - the number of rows in matrix M1.

We will use a similar technique to parallelize and speedup the computation of the fractal Set.

## Code Performance

Let us copy the code from last time with a slight change and redo the performance tests.

def julia_set(width, height, zoom=1, niter=256):
""" A julia set of geometry (width x height) and iterations 'niter' """

w,h = width, height
pixels = np.arange(w*h, dtype=np.uint16).reshape(h, w)

# Pick some defaults for the real and imaginary constants
# This determines the shape of the Julia set.
c_real, c_imag = -0.7, 0.27

for x in range(w):
for y in range(h):
# calculate the initial real and imaginary part of z,
# based on the pixel location and zoom and position values
zx = 1.5*(x - w/2)/(0.5*zoom*w)
zy = 1.0*(y - h/2)/(0.5*zoom*h)

for i in range(niter):
# Iterate till the point is outside
# the circle with radius 2.
# Calculate new positions
zy,zx = 2.0*zx*zy + c_imag, zx*zx - zy*zy + c_real

color = (i >> 21) + (i >> 10)  + i * 8
pixels[y,x] = color

return pixels

def display(width=1024, height=768):
""" Display a julia set of width width and height height """

pixels = julia_set(width, height)
# to display the created fractal
plt.imshow(pixels)
plt.show()


As you may have noticed, the change we made was to move out the display code from the julia_set function to a separate display function. This allows us to do performance tests easier without needing to comment out code. It also allows us to easily plug-in the new parallelized julia set function later.

Here are the performance results for 10 iterations for an image of size 2048x1536, double the default size.

>>> import julia_set
>>> import timeit
>>> from functools import partial
>>> julia_large = partial(julia_set.julia_set, 2048, 1536)
>>> timeit.timeit(julia_large, number=10)/10
28.229172645998187


## CPU Usage

I tracked the CPU usage of my laptop when the above test was running by using the GNOME system monitor. You can install this program for a Debian based system using,

$sudo apt-get install gnome-system-monitor  Launch the program and focus on the second tab named Resources. As you can see the CPUs are mostly idle before running the test. The CPU usage shows a spike and goes up to 100% just after the test is started. However you may have noted that all computations are done in just one of the CPU cores (CPU5 in this case). All other CPUs are idling! This is further confirmed by a capture approximately half-way into the test. The test results show that we are just using a fraction of the available CPU power of the system (1/8 in this case). This is not an ideal scenario as we can speedup the computation and achieve faster results by making use of all the CPUs (cores). Let us see how to do this! ## Multiprocessing Python provides the multiprocessing module which allows one to write programs tailored towards maximizing CPU usage of the system and circumventing the Global Intepreter Lock (GIL) by making use of subprocesses to fan out computation to multiple CPU cores. The most useful primitive provided by this module is the Pool object which maps a given function to any given number of CPU cores. Here is a simple example. def square_roots(numbers): return [i**0.5 for i in numbers]  This is a function taking a list of integers and returning their square roots. Here is howto map it to multiple CPUs on your machine by using the Pool object. import multiprocessing as mp def square_root(number): return number ** 0.5 def square_roots_mp(numbers): """ Calculate square roots of a list of numbers using multiprocessing """ p = mp.Pool(mp.cpu_count()) return p.map(square_root, numbers)  The method cpu_count returns the number of CPUs (cores) in your machine. The Pool object will start so many number of processes to process its input. ## Parallel fractal Set using multiprocessing Let us see how to rewrite our simple 1-CPU fractal program to scale parallely to multiple CPU cores. If you analyze the program, you can see it has two for loops, an outer one iterating over the width and an inner one iterating over the height. for x in range(w): for y in range(h): ...  This is very similar to the matrix multiplication where we iterate over rows in an outer loop and over columns in an inner loop. Here we will parallelize the row calculations using a separate function using multiprocessing. In other words, each pixel row of the final image will be calculated by a Pool of processes running parallely on all CPU cores of your machine. The main workhorse of this rewritten program is a function that will calculate a given row of the fractal Set. def julia_set_calc_row(y, w, h, c_real, c_imag, niter=256, zoom=1,): """ Calculate one row of the julia set with size wxh """ row_pixels = np.arange(w) for x in range(w): # calculate the initial real and imaginary part of z, # based on the pixel location and zoom and position values zx = 1.5*(x - w/2)/(0.5*zoom*w) zy = 1.0*(y - h/2)/(0.5*zoom*h) for i in range(niter): radius_sqr = zx*zx + zy*zy # Iterate till the point is outside # the circle with radius 2. if radius_sqr > 4: break # Calculate new positions zy,zx = 2.0*zx*zy + c_imag, zx*zx - zy*zy + c_real color = (i >> 21) + (i >> 10) + i * 8 row_pixels[x] = color return y,row_pixels  Let us analyze this function, step by step. 1. It accepts an extra argument y over and above what the previous julia_set function which represents the row position (y) that it is currently calculating. If the image is compared to a matrix you can visualize that the height or y co-ordinate as similar to number of rows of the matrix and the width or x co-ordinate as equal to the number of columns. 2. It goes in a loop, iterating over the width using the x co-ordinate and performs the calculations for the fractal Set for that row. For each x value it calculates the pixel color value and assigns it to the index (equal to x) on a row pixel array. 3. It returns a tuple of (y, row_pixels). The y value will be used in the outer for loop which will build the final pixel matrix. How does this change the original function ? Here is the rewritten julia_set function which uses the parallel julia_set_calc_row function. To distinguish it from the original function we are renaming it as julia_set_mp (mp here indicates multiprocessing). def julia_set_mp(width, height, cx=-0.7, cy=0.27, zoom=1, niter=256): """ A julia set of geometry (width x height) and iterations 'niter' """ w,h = width, height pixels = np.arange(w*h, dtype=np.uint32).reshape(h, w) # Pick some defaults for the real and imaginary constants # This determines the shape of the Julia set. c_real, c_imag = cx, cy # Create a Pool of processes pool = mp.Pool() # Create a partial where we vary 'y' julia_partial = functools.partial(julia_set_calc_row, w=w,h=h, c_real=c_real, c_imag=c_imag, niter=niter,zoom=zoom) for y,row_pixel in pool.map(julia_partial, range(h)): pixels[y] = row_pixel return pixels  You can see that the original function has shrunk a lot! This is because most of its core has been moved out to the new parallelized julia_set_calc_row function. The changes here should be pretty straightforward to understand. Here is a quick overview. 1. We create a partial function over julia_set_calc_row where we keep everything constant but vary the y value. 2. We create a Pool object and map the partial function on it, passing it the range of y values. Since the y values range over the height of the image, this is equal to range(h). 3. We receive the result of every iteration as the (y, row_pixel) tuple. We assign this to the pixels matrix, a 2-D numpy array, at the yth index. 4. At the end of the iteration our pixels matrix has all the values computed to draw the fractal Set. ### Performance Improvement Let us test drive our new function to see if it has improved the performance! >>> import julia_set >>> import timeit >>> from functools import partial >>> julia_large = partial(julia_set.julia_set_mp, 2048, 1536) >>> timeit.timeit(julia_large, number=10)/10 24.51141026301775  We got around 4 seconds improvement over the previous version which took around 28 seconds. That is a performance improvement of about 14%. But is that all to it ? Well not exactly. Here are some screenshots of the CPU usage while the test was running. Here is the CPU usage within a few seconds after the start of the test. The main difference, as you may observe is that all the 8 CPU cores are occupied! The multiprocessing code is doing its job effectively spanning the execution across all CPU cores. Zooming into one of the resource utilization graphs shows that the what looks like one big red line is actually 8 separate lines overlapping together. But can we do better than this ? Let us increase the number of iterations by 4X and test again. Here are the test results for the original function with niter=2048. >>> julia_large = partial(julia_set.julia_set, 2048, 1536, niter=2048) >>> timeit.timeit(julia_large, number=10)/10 44.4610462485929  Here are the results for the parallel version. >>> julia_large_mp = partial(julia_set.julia_set_mp, 2048, 1536, niter=2048) >>> timeit.timeit(julia_large_mp, number=10)/10 35.00739418609301  Now there is a performance improvement in wall clock time of about 20%. You will find that the more the number of iterations, the better the parallel version performs when compared to the regular version. ### Wall Clock vs CPU time The timeit module measures the actual time elapsed in real or wall-clock time for the program execution. In this case the performance story is a bit more than that. It is also about utilization of the CPUs. So can we measure this CPU clock time ? Yes, we can. We can use the time program on the linux terminal to do this. This program can time any command supplied to it and it returns 3 separate timing data namely the wall-clock time (real time), user-space (user CPU time) and kernel-space (sys CPU time) respectively. For example, $ time ls > /dev/null

real    0m0.005s
user    0m0.005s
sys     0m0.000s


Let is run this on the julia_set as a script where it is configured to run the single process version. The display method is defined as follows for this test.

def display(width=2048, height=1536):
""" Display a julia set of width width and height height """

pixels = julia_set(width, height)
# to display the created fractal
# plt.imshow(pixels)
# plt.show()


Let us time this on the console.

$time python julia_set.py > /dev/null real 0m27.199s user 0m27.437s sys 0m0.700s  Now let us modify display function so it calls the multiprocessing version. def display(width=2048, height=1536): """ Display a julia set of width width and height height """ pixels = julia_set_mp(width, height) # to display the created fractal # plt.imshow(pixels) # plt.show()  And time it as well. $ time python julia_set.py > /dev/null

real    0m25.329s
user    3m13.846s
sys     0m0.807s


The real-time is a bit less than earlier. However the main difference is between the user CPU time in both cases. In the first case both the real (wall-clock time) and the user (CPU time) are almost the same at approx 27s.

In the multiprocessing version, the CPU (user) time is ~= 3m14s = 194s and the real time is 25s. If you perform the ratio of the first to second and round it to int, you will find that,

>>> n = round(194.0/25.0)
>>> n
8


That is equal to the number of CPU cores. In other words the code has spent approximately 8 times more in the CPU than on the wall clock as it occupied all the 8 CPUs parallely. You can repeat this test multiple times and you will always find that this rounded ratio would equal the number of CPU cores on your system.

## Code

All the code developed today is available at Anvetsu:Visualization Repository

## Summary

We took a CPU bound program - one that generates a Julia fractal Set - and wrote a parallel version of it which was able to take advantage of multiple CPUs (cores) by using data parallel compute techniques. We measured the performance of the code and found it to function better in real-time and CPU time when compared to the original version. The code was able to occupy and make use of all CPUs whereas the original code was able to run on only a single CPU at a given time.

Hopefully this was educative. In the next post of this series, we will build an interactive fractal explorer using matplotlib and the parallel version of the program so that you can play around with the various parameters of the fractal Set and see it getting updated in real-time.

Note that name and e-mail are required for posting comments