What is the best approach when writing functions for embedded software in order to get better performance?

Arguably, in your example the performance would not matter, as the code is only run once at startup.

A rule of thumb I use: Write your code as readable as possible and only start optimizing if you notice that your compiler isn't properly doing its magic.

The cost of a function call in an ISR might be the same as that of a function call during startup in terms of storage and timing. However, the timing requirements during that ISR might be a lot more critical.

Furthermore, as already noticed by others, the cost( and meaning of the 'cost') of a function call differs by platform, compiler, compiler optimization setting, and the requirements of the application. There will be a huge difference between an 8051 and a cortex-m7, and a pacemaker and a light switch.


There is no advantage I can think of (but see note to JasonS at the bottom), wrapping up one line of code as a function or subroutine. Except perhaps that you can name the function something "readable." But you can just as well comment the line. And since wrapping up a line of code in a function costs code memory, stack space, and execution time it seems to me that it is mostly counter-productive. In a teaching situation? It might make some sense. But that depends on the class of students, their preparation beforehand, the curriculum, and the teacher. Mostly, I think it's not a good idea. But that's my opinion.

Which brings us to the bottom line. Your broad question area has been, for decades, a matter of some debate and remains to this day a matter of some debate. So, at least as I read your question, it seems to me to be an opinion-based question (as you asked it.)

It could be moved away from being as opinion-based as it is, if you were to be more detailed about the situation and carefully described the objectives you held as primary. The better you define your measurement tools, the more objective the answers may be.


Broadly speaking, you want to do the following for any coding. (For below, I'll assume that we are comparing different approaches all of which achieve the goals. Obviously, any code that fails to perform the needed tasks is worse than code that succeeds, regardless of how it is written.)

  1. Be consistent about your approach, so that another reading your code can develop an understanding of how you approach your coding process. Being inconsistent is probably the worst possible crime. It not only makes it difficult for others, but it makes it difficult for yourself coming back to the code years later.
  2. To the degree possible, try and arrange things so that initialization of various functional sections can be performed without regard to ordering. Where ordering is required, if it is due to close coupling of two highly related subfunctions, then consider a single initialization for both so that it can be reordered without causing harm. If that isn't possible, then document the initialization ordering requirement.
  3. Encapsulate knowledge in exactly one place, if possible. Constants should not be duplicated all over the place in the code. Equations that solve for some variable should exist in one and only one place. And so on. If you find yourself copying and pasting some set of lines that perform some needed behavior in a variety of locations, consider a way to capture that knowledge in one place and use it where needed. For example, if you have a tree structure that must be walked in a specific way, do not replicate the tree-walking code at each and every place where you need to loop through the tree nodes. Instead, capture the tree-walking method in one place and use it. This way, if the tree changes and the walking method changes, you have only one place to worry about and all the rest of the code "just works right."
  4. If you spread out all of your routines onto a huge, flat sheet of paper, with arrows connecting them as they are called by other routines, you will see in any application there will be "clusters" of routines that have lots and lots of arrows between themselves but only a few arrows outside the group. So there will be natural boundaries of closely coupled routines and loosely coupled connections between other groups of closely coupled routines. Use this fact to organize your code into modules. This will limit the apparent complexity of your code, substantially.

The above is just generally true about all coding. I didn't discuss the use of parameters, local or static global variables, etc. The reason is that for embedded programming the application space often places extreme and very significant new constraints and it's impossible to discuss all of them without discussing every embedded application. And that's not happening here, anyway.

These constraints may be any (and more) of these:

  • Severe cost limitations requiring extremely primitive MCUs with miniscule RAM and almost no I/O pin-count. For these, whole new sets of rules apply. For example, you may have to write in assembly code because there isn't much code space. You may have to use ONLY static variables because the use of local variables is too costly and time consuming. You may have to avoid the excessive use of subroutines because (for example, some Microchip PIC parts) there are only 4 hardware registers in which to store subroutine return addresses. So you may have to dramatically "flatten" your code. Etc.
  • Severe power limitations requiring carefully crafted code to start up and shut down most of the MCU and placing severe limitations on the execution time of code when running at full speed. Again, this might require some assembly coding, at times.
  • Severe timing requirements. For example, there are times where I've had to make sure that the transmission of a open-drain 0 had to take EXACTLY the same number of cycles as the transmission of a 1. And that sampling this same line also had to be performed with an exact relative phase to this timing. This meant that C could NOT be used here. The ONLY possible way to make that guarantee is to carefully craft assembly code. (And even then, not always on all ALU designs.)

And so on. (Wiring code for life-critical medical instrumentation has a whole world of its own, as well.)

The upshot here is that embedded coding often isn't some free-for-all, where you can code like you might on a workstation. There are often severe, competitive reasons for a wide variety of very difficult constraints. And these may strongly argue against the more traditional and stock answers.


Regarding readability, I find that code is readable if it is written in a consistent fashion that I can learn as I read it. And where there isn't a deliberate attempt to obfuscate the code. There really isn't much more required.

Readable code can be quite efficient and it can meet all of the above requirements I've already mentioned. The main thing is that you fully understand what each line of code you write produces at the assembly or machine level, as you code it. C++ places a serious burden on the programmer here because there are many situations where identical snippets of C++ code actually generate different snippets of machine code that have vastly different performances. But C, generally, is mostly a "what you see is what you get" language. So it's safer in that regard.


EDIT per JasonS:

I've been using C since 1978 and C++ since about 1987 and I've had a lot of experience using both for both mainframes, minicomputers, and (mostly) embedded applications.

Jason brings up a comment about using 'inline' as a modifier. (In my perspective, this is a relatively "new" capability because it simply didn't exist for perhaps half of my life or more using C and C++.) The use of inline functions can actually make such calls (even for one line of code) quite practical. And it's far better, where possible, than using a macro because of the typing that the compiler can apply.

But there are limitations, as well. The first is that you cannot rely on the compiler to "take the hint." It may, or may not. And there are good reasons not to take the hint. (For an obvious example, if the address of the function is taken, this requires the instantiation of the function and the use of the address to make the call will ... require a call. The code cannot be inlined then.) There are other reasons, as well. Compilers may have a wide variety of criteria by which they judge how to handle the hint. And as a programmer, this means you must spend some time learning about that aspect of the compiler or else you are likely to make decisions based upon flawed ideas. So it adds a burden both to the writer of the code and also any reader and also anyone planning to port the code to some other compiler, as well.

Also, C and C++ compilers support separate compilation. This means that they can compile one piece of C or C++ code without compiling any other related code for the project. In order to inline code, assuming the compiler otherwise might choose to do so, it not only must have the declaration "in scope" but it must also have the definition, as well. Usually, programmers will work to ensure that this is the case if they are using 'inline'. But it is easy for mistakes to creep in.

In general, while I also use inline where I think it is appropriate, I tend to assume that I cannot rely on it. If performance is a significant requirement, and I think the OP has already clearly written that there has been a significant performance hit when they went to a more "functional" route, then I certainly would choose to avoid relying upon inline as a coding practice and would instead follow a slightly different, but entirely consistent pattern of writing code.

A final note about 'inline' and definitions being "in scope" for a separate compilation step. It is possible (not always reliable) for the work to be performed at the linking stage. This can occur if and only if a C/C++ compiler buries enough detail into the object files to allow a linker to act on 'inline' requests. I personally haven't experienced a linker system (outside of Microsoft's) that supports this capability. But it can occur. Again, whether or not it should be relied upon will depend on the circumstances. But I usually assume this hasn't been shoveled onto the linker, unless I know otherwise based on good evidence. And if I do rely on it, it will be documented in a prominent place.


C++

For those interested, here's an example of why I remain fairly cautious of C++ when coding embedded applications, despite its ready availability today. I'll toss out some terms that I think all embedded C++ programmers need to know cold:

  • partial template specialization
  • vtables
  • virtual base object
  • activation frame
  • activation frame unwind
  • use of smart pointers in constructors, and why
  • return value optimization

That's just a short list. If you don't already know everything about those terms and why I listed them (and many more I didn't list here) then I'd advise against the use of C++ for embedded work, unless it is not an option for the project.

Let's take a quick look at C++ exception semantics to get just a flavor.

A C++ compiler must generate correct code for compilation unit \$A\$ when it has absolutely no idea what kind of exception handling may be required in separate compilation unit \$B\$, compiled separately and at a different time.

Take this sequence of code, found as part of some function in some compilation unit \$A\$:

   .
   .
   foo ();
   String s;
   foo ();
   .
   .

For discussion purposes, compilation unit \$A\$ doesn't use 'try..catch' anywhere in its source. Neither does it use 'throw'. In fact, let's say that it doesn't use any source that couldn't be compiled by a C compiler, except for the fact that it uses C++ library support and can handle objects like String. This code might even be a C source code file that was modified slightly to take advantage of a few C++ features, such as the String class.

Also, assume that foo() is an external procedure located in compilation unit \$B\$ and that the compiler has a declaration for it, but does not know its definition.

The C++ compiler sees the first call to foo() and can just allow a normal activation frame unwind to occur, if foo() throws an exception. In other words, the C++ compiler knows that no extra code is needed at this point to support the frame unwind process involved in exception handling.

But once String s has been created, the C++ compiler knows that it must be properly destroyed before a frame unwind can be allowed, if an exception occurs later on. So the second call to foo() is semantically different from the first. If the 2nd call to foo() throws an exception (which it may or may not do), the compiler must have placed code designed to handle the destruction of String s before letting the usual frame unwind occur. This is different than the code required for the first call to foo().

(It is possible to add additional decorations in C++ to help limit this problem. But the fact is, programmers using C++ simply must be far more aware of the implications of each line of code they write.)

Unlike C's malloc, C++'s new uses exceptions to signal when it cannot perform raw memory allocation. So will 'dynamic_cast'. (See Stroustrup's 3rd ed., The C++ Programming Language, pages 384 and 385 for the standard exceptions in C++.) Compilers may allow this behavior to be disabled. But in general you will incur some overhead due to properly formed exception handling prologues and epilogues in the generated code, even when the exceptions actually do not take place and even when the function being compiled doesn't actually have any exception handling blocks. (Stroustrup has publicly lamented this.)

Without partial template specialization (not all C++ compilers support it), the use of templates can spell disaster for embedded programming. Without it, code bloom is a serious risk which could kill a small-memory embedded project in a flash.

When a C++ function returns an object an unnamed compiler temporary is created and destroyed. Some C++ compilers can provide efficient code if an object constructor is used in the return statement, instead of a local object, reducing the construction and destruction needs by one object. But not every compiler does this and many C++ programmers aren't even aware of this "return value optimization."

Providing an object constructor with a single parameter type may permit the C++ compiler to find a conversion path between two types in completely unexpected ways to the programmer. This kind of "smart" behavior isn't part of C.

A catch clause specifying a base type will "slice" a thrown derived object, because the thrown object is copied using the catch clause's "static type" and not the object's "dynamic type." A not uncommon source of exception misery (when you feel you can even afford exceptions in your embedded code.)

C++ compilers can automatically generate constructors, destructors, copy constructors, and assignment operators for you, with unintended results. It takes time to gain facility with the details of this.

Passing arrays of derived objects to a function accepting arrays of base objects, rarely generate compiler warnings but almost always yields incorrect behavior.

Since C++ doesn't invoke the destructor of partially constructed objects when an exception occurs in the object constructor, handling exceptions in constructors usually mandates "smart pointers" in order to guarantee that constructed fragments in the constructor are properly destroyed if an exception does occur there. (See Stroustrup, page 367 and 368.) This is a common issue in writing good classes in C++, but of course avoided in C since C doesn't have the semantics of construction and destruction built in. Writing proper code to handle the construction of subobjects within an object means writing code that must cope with this unique semantic issue in C++; in other words "writing around" C++ semantic behaviors.

C++ may copy objects passed to object parameters. For example, in the following fragments, the call "rA(x);" may cause the C++ compiler to invoke a constructor for the parameter p, in order to then call the copy constructor to transfer object x to parameter p, then another constructor for the return object (an unnamed temporary) of function rA, which of course is copied from parameter p. Worse, if class A has its own objects which need construction, this can telescope disasterously. (A C programmer would avoid most of this garbage, hand optimizing since C programmers don't have such handy syntax and have to express all the details one at a time.)

    class A {...};
    A rA (A p) { return p; }
    // .....
    { A x; rA(x); }

Finally, a short note for C programmers. longjmp() doesn't have a portable behavior in C++. (Some C programmers use this as a kind of "exception" mechanism.) Some C++ compilers will actually attempt to set things up to clean up when the longjmp is taken, but that behavior isn't portable in C++. If the compiler does clean up constructed objects, it's non-portable. If the compiler doesn't clean them up, then the objects aren't destructed if the code leaves the scope of the constructed objects as a result of the longjmp and the behavior is invalid. (If use of longjmp in foo() doesn't leave a scope, then the behavior may be fine.) This isn't too often used by C embedded programmers but they should make themselves aware of these issues before using them.


1) Code for readability and maintainability first. The most important aspect of any codebase is that it is well-structured. Nicely written software tends to have less errors. You may need to make changes in a couple of weeks/months/years, and it helps immensely if your code is nice to read. Or maybe someone else has to make a change.

2) Performance of code that runs once does not matter very much. Care for style, not for performance

3) Even code in tight loops needs to be correct first and foremost. If you face performance issues, then optimize once the code is correct.

4) If you need to optimize, you have to measure! It does not matter if you think or someone tells you that static inline is just a recommendation to the compiler. You have to take a look at what the compiler does. You also have to measure if inlining did improve performance. In embedded systems, you also have to measure code size, since code memory is usually pretty limited. This is THE most important rule that distinguishes engineering from guesswork. If you didn't measure it, it didn't help. Engineering is measuring. Science is writing it down ;)