by Anand | tags: [ fractal python julia visualization matplotlib numpy multiprocessing concurrency parallelism ]
Fractals with Python - Part II - Concurrency
In the first part of this series, we saw how to build a Julia fractal Set using Python, matplotlib and numpy. In today’s post, we will take the code we built and improve its performance by parallelizing the computations using Python’s multiprocessing
module.
Data Parallel Problems
My friend has given a good explanation of Concurrency vs Parallelism in a previous post in this blog. So I will not dwelve on this aspect a lot. Instead let us take a quick look at what a data parallel
problem is.
A data parallel problem is one where the computation can be parallelized along an axis
or dimension
in the data structure so that the calculations can be computed parallely in n
nodes or n
CPUs and the results combined together to form the final output. This will usually offer a speed-up by F(n)
where F is an arbitrary function.
Many data structures such as arrays and matrices lend themselves to such computations.
Let us take a quick look at a well known data parallel problem - Matrix Multiplication.
Matrix Multiplication
As you know, we can multiply a matrix M1
of order $P x Q$ with another matrix M2
of order say $Q x R$. In other words, the number of rows of M2
should be equal to the number of columns of M1
. The result would be a matrix M3
of order $P x R$
The following example illustrates this. We multiply a $3 x 3$ matrix with a $3 x 2$ matrix to get the result, another $3 x 2$ matrix.
$$\left[ \begin{array}{c} p & q & r\newline s & t & u\newline v & w & x\newline \end{array} \right] \times \left[ \begin{array}{cc} a & d\newline b & e\newline c & f\newline \end{array} \right] == \begin{bmatrix} pa + qb + rc & pd + qe + rf\newline sa + tb + uc & sd + te + uf\newline va + wb + xc & vd + we + xf\end{bmatrix}$$
The multiplication can be expressed by the pseudo-code (courtesy: Wikipedia)
M3 = make_matrix(M1.num_rows, M2.num_columns)
for i in range(M1.num_rows):
for k in range(M2.num_columns):
sum = 0
for j in range(M1.num_columns):
sum += M1[i][j]*M2[j][k]
M3[i][k]=sum;
In this case, since each row of the resultant matrix can be computed independently of any other row, the computation is data parallel. We can write code so that each row’s computation is done in a separate compute node or CPU offering us a parallelism and speedup as a funciton of P
- the number of rows in matrix M1
.
We will use a similar technique to parallelize and speedup the computation of the fractal Set.
Code Performance
Let us copy the code from last time with a slight change and redo the performance tests.
def julia_set(width, height, zoom=1, niter=256):
""" A julia set of geometry (width x height) and iterations 'niter' """
w,h = width, height
pixels = np.arange(w*h, dtype=np.uint16).reshape(h, w)
# Pick some defaults for the real and imaginary constants
# This determines the shape of the Julia set.
c_real, c_imag = -0.7, 0.27
for x in range(w):
for y in range(h):
# calculate the initial real and imaginary part of z,
# based on the pixel location and zoom and position values
zx = 1.5*(x - w/2)/(0.5*zoom*w)
zy = 1.0*(y - h/2)/(0.5*zoom*h)
for i in range(niter):
radius_sqr = zx*zx + zy*zy
# Iterate till the point is outside
# the circle with radius 2.
if radius_sqr > 4: break
# Calculate new positions
zy,zx = 2.0*zx*zy + c_imag, zx*zx - zy*zy + c_real
color = (i >> 21) + (i >> 10) + i * 8
pixels[y,x] = color
return pixels
def display(width=1024, height=768):
""" Display a julia set of width `width` and height `height` """
pixels = julia_set(width, height)
# to display the created fractal
plt.imshow(pixels)
plt.show()
As you may have noticed, the change we made was to move out the display code from the julia_set
function to a separate display
function. This allows us to do performance tests easier without needing to comment out code. It also allows us to easily plug-in the new parallelized julia set function later.
Here are the performance results for 10 iterations for an image of size 2048x1536, double the default size.
>>> import julia_set
>>> import timeit
>>> from functools import partial
>>> julia_large = partial(julia_set.julia_set, 2048, 1536)
>>> timeit.timeit(julia_large, number=10)/10
28.229172645998187
CPU Usage
I tracked the CPU usage of my laptop when the above test was running by using the GNOME system monitor. You can install this program for a Debian
based system using,
$ sudo apt-get install gnome-system-monitor
Launch the program and focus on the second tab named Resources
.
As you can see the CPUs are mostly idle before running the test.
The CPU usage shows a spike and goes up to 100% just after the test is started.
However you may have noted that all computations are done in just one of the CPU cores (CPU5 in this case). All other CPUs are idling!
This is further confirmed by a capture approximately half-way into the test.
The test results show that we are just using a fraction of the available CPU power of the system (1/8
in this case). This is not an ideal scenario as we can speedup the computation and achieve faster results by making use of all the CPUs (cores).
Let us see how to do this!
Multiprocessing
Python provides the multiprocessing
module which allows one to write programs tailored towards maximizing CPU usage of the system and circumventing the Global Intepreter Lock (GIL)
by making use of subprocesses to fan out computation to multiple CPU cores.
The most useful primitive provided by this module is the Pool
object which maps a given function to any given number of CPU cores. Here is a simple example.
def square_roots(numbers):
return [i**0.5 for i in numbers]
This is a function taking a list of integers and returning their square roots.
Here is howto map it to multiple CPUs on your machine by using the Pool
object.
import multiprocessing as mp
def square_root(number):
return number ** 0.5
def square_roots_mp(numbers):
""" Calculate square roots of a list of numbers
using multiprocessing """
p = mp.Pool(mp.cpu_count())
return p.map(square_root, numbers)
The method cpu_count
returns the number of CPUs (cores) in your machine. The Pool
object will start so many number of processes to process its input.
Parallel fractal Set using multiprocessing
Let us see how to rewrite our simple 1-CPU fractal program to scale parallely to multiple CPU cores.
If you analyze the program, you can see it has two for loops, an outer one iterating over the width and an inner one iterating over the height.
for x in range(w):
for y in range(h):
...
This is very similar to the matrix multiplication where we iterate over rows in an outer loop and over columns in an inner loop. Here we will parallelize the row calculations using a separate function using multiprocessing. In other words, each pixel row of the final image will be calculated by a Pool of processes running parallely on all CPU cores of your machine.
The main workhorse of this rewritten program is a function that will calculate a given row of the fractal Set.
def julia_set_calc_row(y, w, h, c_real, c_imag, niter=256, zoom=1,):
""" Calculate one row of the julia set with size wxh """
row_pixels = np.arange(w)
for x in range(w):
# calculate the initial real and imaginary part of z,
# based on the pixel location and zoom and position values
zx = 1.5*(x - w/2)/(0.5*zoom*w)
zy = 1.0*(y - h/2)/(0.5*zoom*h)
for i in range(niter):
radius_sqr = zx*zx + zy*zy
# Iterate till the point is outside
# the circle with radius 2.
if radius_sqr > 4: break
# Calculate new positions
zy,zx = 2.0*zx*zy + c_imag, zx*zx - zy*zy + c_real
color = (i >> 21) + (i >> 10) + i * 8
row_pixels[x] = color
return y,row_pixels
Let us analyze this function, step by step.
- It accepts an extra argument
y
over and above what the previousjulia_set
function which represents the row position (y) that it is currently calculating. If the image is compared to a matrix you can visualize that the height or y co-ordinate as similar to number of rows of the matrix and the width or x co-ordinate as equal to the number of columns. - It goes in a loop, iterating over the width using the x co-ordinate and performs the calculations for the fractal Set for that row. For each
x
value it calculates the pixel color value and assigns it to the index (equal tox
) on a row pixel array. - It returns a tuple of
(y, row_pixels)
. They
value will be used in the outer for loop which will build the final pixel matrix.
How does this change the original function ? Here is the rewritten julia_set
function which uses the parallel julia_set_calc_row
function. To distinguish it from the original function we are renaming it as julia_set_mp
(mp
here indicates multiprocessing
).
def julia_set_mp(width, height, cx=-0.7, cy=0.27, zoom=1, niter=256):
""" A julia set of geometry (width x height) and iterations 'niter' """
w,h = width, height
pixels = np.arange(w*h, dtype=np.uint32).reshape(h, w)
# Pick some defaults for the real and imaginary constants
# This determines the shape of the Julia set.
c_real, c_imag = cx, cy
# Create a Pool of processes
pool = mp.Pool()
# Create a partial where we vary 'y'
julia_partial = functools.partial(julia_set_calc_row,
w=w,h=h,
c_real=c_real, c_imag=c_imag,
niter=niter,zoom=zoom)
for y,row_pixel in pool.map(julia_partial, range(h)):
pixels[y] = row_pixel
return pixels
You can see that the original function has shrunk a lot! This is because most of its core has been moved out to the new parallelized julia_set_calc_row
function.
The changes here should be pretty straightforward to understand. Here is a quick overview.
- We create a partial function over
julia_set_calc_row
where we keep everything constant but vary they
value. - We create a
Pool
object and map the partial function on it, passing it the range of y values. Since the y values range over the height of the image, this is equal torange(h)
. - We receive the result of every iteration as the
(y, row_pixel)
tuple. We assign this to thepixels
matrix, a 2-D numpy array, at they
th index. - At the end of the iteration our
pixels
matrix has all the values computed to draw the fractal Set.
Performance Improvement
Let us test drive our new function to see if it has improved the performance!
>>> import julia_set
>>> import timeit
>>> from functools import partial
>>> julia_large = partial(julia_set.julia_set_mp, 2048, 1536)
>>> timeit.timeit(julia_large, number=10)/10
24.51141026301775
We got around 4 seconds improvement over the previous version which took around 28 seconds. That is a performance improvement of about 14%. But is that all to it ?
Well not exactly. Here are some screenshots of the CPU usage while the test was running.
Here is the CPU usage within a few seconds after the start of the test.
The main difference, as you may observe is that all the 8 CPU cores are occupied! The multiprocessing code is doing its job effectively spanning the execution across all CPU cores.
Zooming into one of the resource utilization graphs shows that the what looks like one big red line is actually 8 separate lines overlapping together.
But can we do better than this ? Let us increase the number of iterations by 4X and test again.
Here are the test results for the original function with niter=2048
.
>>> julia_large = partial(julia_set.julia_set, 2048, 1536, niter=2048)
>>> timeit.timeit(julia_large, number=10)/10
44.4610462485929
Here are the results for the parallel version.
>>> julia_large_mp = partial(julia_set.julia_set_mp, 2048, 1536, niter=2048)
>>> timeit.timeit(julia_large_mp, number=10)/10
35.00739418609301
Now there is a performance improvement in wall clock time of about 20%. You will find that the more the number of iterations, the better the parallel version performs when compared to the regular version.
Wall Clock vs CPU time
The timeit
module measures the actual time elapsed in real
or wall-clock time for the program execution. In this case the performance story is a bit more than that. It is also about utilization of the CPUs. So can we measure this CPU clock
time ?
Yes, we can.
We can use the time
program on the linux terminal to do this. This program can time any command supplied to it and it returns 3 separate timing data namely the wall-clock time (real
time), user-space (user
CPU time) and kernel-space (sys
CPU time) respectively.
For example,
$ time ls > /dev/null
real 0m0.005s
user 0m0.005s
sys 0m0.000s
Let is run this on the julia_set
as a script where it is configured to run the single process version. The display method is defined as follows for this test.
def display(width=2048, height=1536):
""" Display a julia set of width `width` and height `height` """
pixels = julia_set(width, height)
# to display the created fractal
# plt.imshow(pixels)
# plt.show()
Let us time this on the console.
$ time python julia_set.py > /dev/null
real 0m27.199s
user 0m27.437s
sys 0m0.700s
Now let us modify display
function so it calls the multiprocessing version.
def display(width=2048, height=1536):
""" Display a julia set of width `width` and height `height` """
pixels = julia_set_mp(width, height)
# to display the created fractal
# plt.imshow(pixels)
# plt.show()
And time it as well.
$ time python julia_set.py > /dev/null
real 0m25.329s
user 3m13.846s
sys 0m0.807s
The real-time is a bit less than earlier. However the main difference is between the user
CPU time in both cases.
In the first case both the real (wall-clock time) and the user (CPU time) are almost the same at approx 27s.
In the multiprocessing version, the CPU (user) time is ~= 3m14s = 194s and the real time is 25s. If you perform the ratio of the first to second and round it to int, you will find that,
>>> n = round(194.0/25.0)
>>> n
8
That is equal to the number of CPU cores. In other words the code has spent approximately 8 times more in the CPU than on the wall clock as it occupied all the 8 CPUs parallely. You can repeat this test multiple times and you will always find that this rounded ratio would equal the number of CPU cores on your system.
Code
All the code developed today is available at Anvetsu:Visualization Repository
Summary
We took a CPU bound program - one that generates a Julia fractal Set - and wrote a parallel version of it which was able to take advantage of multiple CPUs (cores) by using data parallel compute techniques. We measured the performance of the code and found it to function better in real-time and CPU time when compared to the original version. The code was able to occupy and make use of all CPUs whereas the original code was able to run on only a single CPU at a given time.
Hopefully this was educative. In the next post of this series, we will build an interactive fractal explorer using matplotlib and the parallel version of the program so that you can play around with the various parameters of the fractal Set and see it getting updated in real-time.
Note that name and e-mail are required for posting comments