Get to Know a Feature: Automatic Multithreading

A number of operations in Igor Pro 7 contain code that automatically executes using multiple execution threads.  In this blog post I'll examine the benefits of automatic multithreading and discuss some limits of this technology.  

What is automatic multithreading?

Most modern computers contain two or more central processing units or cores.  Each core can execute program code in an independent sequence of instructions known as a thread.  Ideally, your program would split any task into multiple threads so that all available cores work simultaneously to complete the task.  Theoretically, such distribution of processing could, at most, achieve an N-fold performance increase.  Here N is the number of threads that can run concurrently on your computer.  Note that N is at least as large as the number of cores but could be larger with hyper-threading or similar technology.

It is easy to see that not all tasks lend themselves to multithreading.  Consider for example the simple task of computing the sum of wave data.  A basic code snippet might look like this:

    Variable theSum=0
    for(i=0;i<npoints;i+=1)
        theSum+=wave1[i]
    endfor

If you have a very long wave you can divide it into N segments and have each segment summed by a separate thread.  When all threads complete their execution you can combine the various segment sums into a grand sum for the whole wave.  However, the same one-step multithreading approach can't be used to sort the wave or even to calculate any statistical moment higher than the mean.

In practice, the actual performance increase is usually lower than N-fold because it takes time to set up a thread (moving instructions to each core's instructions cache and moving data to each core's data cache). 

In Igor Pro operations and functions that support automatic multithreading, we wrote the code so that Igor may perform the calculations using a single or multiple threads depending on the size of the data.

Examples

Here are three examples of automatic multithreading.  They are presented in the order of increasing complexity.

The first example of automatic multithreading is using WaveStats where the default execution involves at least two automatic multithreading steps.  Figure 1 shows the ratio of execution time of single threaded WaveStats to execution time when running with automatic multithreading on a 4-core Macintosh.  As you can see, when the input wave has only 1000 points the multithreaded execution actually takes longer to complete than the single threaded execution.  This is because the overhead of setting up the various threads is large compared to the time it takes to calculate the actual statistics.  When I increase the size of the input wave the multithreaded execution shows improved performance up to a factor of approximately 4.  Further increases in the size of the input wave (1e6 to 1e7 points) do not improve the ratio.

Figure 1: Ratio of execution times of WaveStats for different lengths of input.

Figure 1: Ratio of execution times of WaveStats for different lengths of input.

Here is our testing code:

Function timer1(num,size)
    Variable num,size
    
    Make/FREE/N=(size) ddd=enoise(1)
    Make/free/o/n=(num) results1,results2
    Variable i
    
    for(i=0;i<num;i+=1)
        MultiThreadingControl setMode=8
        results1[i]=doFunction1(ddd)
        MultiThreadingControl setMode=0
        results2[i]=doFunction1(ddd)
    endfor
    MatrixOP/O ratio=results2/results1
End 

Function doFunction1(inWave)
    Wave inWave
    
    Variable t1=startmstimer
    WaveStats/Q inWave
    return stopMsTimer(t1)
End 

Note that I'm using MultiThreadingControl to force the multithreading on or off.  You can read more about this operation later in this post.

The second example involves 2D data where I test the performance of ImageLineProfile.  Here the input image is a relatively large image (1e4 x 1e4), the profile was defined along the diagonal and I tested the effect of the width of the profile on computation time.  The results are shown in Figure 2 below.  In this case even at width=0 there is approximately a factor of 2 speed improvement in the multi-threaded calculation.  The improvement ratio saturates for width ≅10 but the best ratio is only 3.  This lackluster performance may be explained by the size of the memory that a thread is trying to access compared to the size of the available cache.

Figure 2: ImageLineProfile calculated along the diagonal of a 1e4 x 1e4 image using different profile width.

Figure 2: ImageLineProfile calculated along the diagonal of a 1e4 x 1e4 image using different profile width.

Automatic multithreading provides significant performance enhancement with user-defined curve fitting but only marginal performance improvement for fitting with built-in functions.  For the third example I test execution times when fitting data to a user-defined function.  Single threaded execution appears to be equivalent to multithreaded execution when the data consists of 100 points.  When the data consists of 10,000 points and up the performance ratio is maximized.

Figure 3:  Ratio of execution times so single threaded to multithreaded user-defined curve fitting.  

Figure 3:  Ratio of execution times so single threaded to multithreaded user-defined curve fitting.

 

Automatic Multithreading Support

The following operations and functions support automatic multithreading:

  • WaveStats
  • ImageThreshold
  • ImageInterpolate
  • Variance
  • ICA
  • ImageEdgeDetection
  • ImageRestore
  • ImageMorphology
  • ImageFilter
  • ImageBlend
  • ImageWindow
  • ImageStats
  • ImageRotate
  • MatrixOP (selected functions)
  • CurveFit
  • FuncFit (partial support)
  • Integrate1d
  • Smooth (binomial)
  • DSPPeriodogram
  • StatsKDE
  • ImageLineProfile

Each of the operations/functions above has two execution options: one using a single thread and another using multiple threads.  By default Igor determines the execution path based on the number of underlying computations.  Each operation has a default threshold for the number of data points at which Igor switches to multithreading.  

You can use the MultiThreadingControl operation to determine the default thresholds for the various operations/functions that support automatic multithreading.  You can also use MultiThreadingControl to provide your own set of thresholds or completely override the threshold values with the setMode keyword.

Note that by default automatic multithreading is disabled for code executed from user threads.


Overhead and Processor Cycles

According to Intel, a loop could benefit from automatic multithreading if it takes more than 1 million clock cycles to execute.  In practice this statement is not particularly helpful because different processors require different number of clock cycles to execute the same set of instructions.  Also, from a user's point of view it is often difficult to estimate the number of cycles required to execute a generic iteration of a loop.

If you are optimizing the performance of your code it is best to run tests on the target hardware similar to the code I showed above so that you can find the optimal multithreading thresholds for the specific operations/functions that you need.

I hope this blog post helps demystify Igor's automatic multithreading.