Simple, efficient code profiling technique for C/C++ (part 2 of 2)

Publié dans: 

Welcome to the Part 2 of this article (part 1). In this part, we continue our last discussion about code profiling. As it has already been noted, existing profiling tools were either so complicated (both in their mechanisms and their output), or so simple and inefficient.

In order to solve a specific time issue, an average developer often resorts to the simplest solution: easy-to-integrate and compile-time source scripts. At the expense of learning handful routines, target application is instrumented, compiled, run then timing information result is shown in its very simple and non-informative way: a MessageBox, log file, etc.

In this article, the idea is yet to build up from this simple form of profiling, but hopefully we will provide a useful approach. Three goals to meet:

  • Simplicity : for this point, we have to write powerful, yet compact instrumentation interface routines (IIR) that are easy to use and do not require a big learning effort. Developers (who freed themselves from existing profilers because of their overly complicated output) are reluctant to trade with new kind of complication: that of a difficult IIR to setup.
  • Efficient output and easily interpretable results: after all, this is our essential aim. Part of the role is indeed to be played by the design itself of IIRs: if the developer is obliged to gather source-level information in a specific way, then results can be streamlined to specific output patterns where from it would be easy to induce interpretation.
  • Comparison and optimization tracking: here we refer to the ability to take two test results and have an idea about the performance differences, hence if there has been an improvement or not.

This kind of solution reminds us of a past experience: we had a multithreaded server application of which model is based on the Connector-Acceptor-Reactor pattern (concept initiated in the book “Pattern Languages of Program Design 3″, edited by Robert Martin, Frank Buschmann and Dirk Riehle (Addison Wesley, 1997). You may also want to check the great pattern drawn by Khang Duong in his Master Thesis, fig 5.5): a listener thread is first started to loop on each connection request. Each IO request on each of the accepted connection is sent to a FIFO queue. A pool of threads are spawn to wait on the queue and handle any new request then return back to their blocking state. Of course there are system events, critical sections and semaphores together to allow this mechanism.

Servicing code is organized into several ServiceHandlers. For instance, a service handler may be responsible for routing a particular chat message to a particular chat room. Another handle could be responsible for the particular request of joining a pending game: the request being initiated by a particular player and the game belonging to another specific player, and so on.

Some interesting questions have been raised after development:

  • What are the average durations spent in each type of ServiceHandler?
  • What are the durations’ costs by each ServiceHandler for subtasks like decoding request, encoding and buffering responses?
  • What is the load we have on the threads pool? As indicator, how much time are our threads spending in their “running” state inside those ServiceHandlers?

To solve all this, a dedicated dashboard was developed to receive and render the key indicators. This regretful effort could have been easily invested into a generic solution that gather performance data without a hard restriction on what that data consists in and also prepare the output image that dynamically adapt to the data.

Note the correlation between what the answers to the above questions could be implying and the very notion of model throughput, performance, system responsiveness, etc.

No wonder that the proposed IIR will be designed so to enable these metrics:

IIR list:

 

extern “C” void WINAPI InitProfiler(char* testName,char* flushType,char* output);

extern “C” void WINAPI WaitForObserver()
extern “C” void WINAPI FlushData(char* outputPath);

extern “C” void WINAPI beginActivity(char* activityName)
extern “C” void WINAPI endActivity(char* activityName);

extern “C” void WINAPI insertCheckpoint(char* pointName,char* activityName);

extern “C” void WINAPI threadRun(char* threadName)
extern “C” void WINAPI threadPause(char* threadName);

extern “C” void WINAPI beginTask(char* taskName)
extern “C” void WINAPI endTask(char* taskName);

 

We can surround this list of functions by a macro-definition (#ifdef ..) in order to enable the easy-removal of instrumentation code. Let’s explain what each IIR is supposed to do in the implementation module (a DLL in the above example):

InitProfiler is mainly to mark the beginning of profiling. Flushtype parameter is important to be able to handle the following three scenarios:

  1. The collected timing information could be streamed to RAM then manually flushed to a file (FlushData).
  2. A scenario where the collected data cannot be recorded at a high detail. This is the case for example of trying to profile a chess engine: we cannot record all the milions calls to key tasks like chess position evaluation or move generation. Instead, we only record the cumulative sums of periods spent in each routine, function.
  3. A scenario where we need to stream the collected data to a remote observer. This is the case where we want to draw performance knowledge about a service application deployed on some remote server.

An activity is a sort of a macrobloc, we want to measure its period or track its particular inner events (insertCheckPoint).

A task is a subactivity that is completed after a finite amount of time. Developers must mark the beginning and the end of a particular task with beginTask and endTask.

Finally, threadRun and threadPause are instrumented whenever we have a particular worker routine to be executed concurrently or separately from the main application thread.

 

In a particular complex model, we often have some sort of distributed jobs, thread pools, etc. For a particular high-level task, or activity in our terminology, this can be due to the inner precedence scheme governing the detail instructions. This model has been clearly illustrated by Cilk in page 4 of this document, it could be also a great material to learn more about the “Multicore Software Revolution”. Interleaving [is it the right word?] the execution of subtasks of the main activity can lead to an overall improvement (in a parallel hardware of course) of time spent in that activity.

 

It is essential to design such high level IIR model: existing sophisticated profilers contaminate you by thousands of indicators and thousands of obsolete functions you never heard of and you never going to optimize. Information is displayed on the procedural level: time(kernal+user)+call count for each method, API calls, system modules stats, etc. The above IIRs once correctly instrumented to a target application, it would easily reflect the meta-functions of the program.

 

An immediate advantage is to free the output of the unrelated details of the implementations. Hence even a manager or marketing engineer can analyze and criticize an existing behaviour or communicate some idea to the developer by just looking at the profiling results.

 

In addition, we have to remind that optimization often succeed at the model level, i.e. the architectural level: that is at least what any C++ developer can assert from possible deceptions trying to improve on the performance of some program by inlining his functions and trying all the other solutions the C++ developer knows.

 

The plots below are relative to a basic demonstration use case where we have a simple serial block that does a certain job A. Within the code checkpoints were inserted to mark events : P1, P2, P3.

T1, T2 and T3 relate each to particular functions that are consumes many times by the code in A.

 

Visualizing the timeline history of A would give:

 

A streamlined output that answers the question of how much time did the job A consumed in the task T2 would give for example:

(We reported the differences, P(i+1)-P(i) against the sum of time spent by T1 between each [P(i), P(i+1)].

So forth, you could generate as many outputs (so plots) as possible questions you could have. Try to find plots to answer the following questions:

  • What is the difference between the timeline of A before and after optimization?
  • What is the change in the contribution of the task T1 in the activity T?

You can leave us a comment to help you find out more. If you want to try the profiler on your own programs, we will be glad to offer it for free! (Drop a comment or send an email to blog[at]iptech-group[dot]com).