Before going on on my graphic library for led matrix, I think it’s time to optimize a bit the code in order to get the Netduino running faster.
My job is programming application using .Net for desktop, but a PC is very rich of resources such as RAM and processor speed. Instead, the Micro Framework offers a very small environment where every byte more might have an impact on the final result.

Here is a brief bunch of tests for showing a comparison on different approaches against a same task. Sometime you don’t care about the best way to write the code, but the interesting thing is actually knowing how the things are working. You will be surprised, as I was.
The test bench.
The base program for the tests is very simple: it is an endless loop where the code under test runs interleaved by a short pause of 100ms. The comparison is mostly against different yet commonly-used types, such as Int32, Single, Double and Byte.
The timings are taken by using a scope, then watching at two output ports when they change their state.
Except for the very first, each test cycles 50 times over a 20-operations set: that for minimize the overhead due to the “for-loop”. By the way, the first test is targeted just for get the “for-loop” heaviness.
It follows the test program template:
namespace PerformanceTest
{
public class Program
{
private const int Count = 50;
private static OutputPort QTest = new OutputPort(Pins.GPIO_PIN_D0, false);
private static OutputPort QPulse = new OutputPort(Pins.GPIO_PIN_D1, false);
public static void Main()
{
byte b;
byte bx = 50;
byte by = 16;
int i;
int ix = 50;
int iy = 16;
float f;
float fx = 50.0f;
float fy = 16.0f;
double d;
double dx = 50.0;
double dy = 16.0;
while (true)
{
//start of the test
QTest.Write(true);
// ... operations to test ...
//end of the test
QTest.Write(false);
Thread.Sleep(100);
}
}
private static void Pulse()
{
QPulse.Write(true);
QPulse.Write(false);
}
}
}
The basic for-loop.
Since every test will use the “for-loop”, we should measure how much overhead that introduces.
Here is the snippet…
for (int n = 0; n < 1000; n++)
{
//do nothing
}
…and here is the timing:

Roughly speaking, we could say that every for-loop cycle takes about 7 microseconds.
How does look the IL-opcodes generated by the compiler (restricted to the only for-loop)?
Well, it is pretty interesting digging a bit behind (or under?) the scenes. I will take advantage by the awesome ILSpy, which is a free, open-source decompiler, disassembler and much more provided by the SharpDevelop teams.
IL_0042: ldc.i4.0
IL_0043: stloc.s n
IL_0045: br.s IL_004f
// loop start (head: IL_004f)
IL_0047: nop
IL_0048: nop
IL_0049: ldloc.s n
IL_004b: ldc.i4.1
IL_004c: add
IL_004d: stloc.s n
IL_004f: ldloc.s n
IL_0051: ldc.i4 1000
IL_0056: clt
IL_0058: stloc.s CS$4$0000
IL_005a: ldloc.s CS$4$0000
IL_005c: brtrue.s IL_0047
// end loop
Notice how the final branch-on-true jumps back to the first opcode, which implies a couple of “nop”s: why?
Anyway, we are not going to optimize the for-loop yet.
Addition.
The addition will be performed over three common types: Int32, Single and Double.
Here is the snippet…
for (int n = 0; n < Count; n++)
{
i = ix + iy; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
f = fx + fy; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
d = dx + dy; //repeated 20 times
}
…and here is the timing:

Again, an “average” addition takes about 2 microseconds.
Many users are blaming the poor speed of a board like Netduino, because its core can run at over 200Mips. Two microseconds for an addition (integer or floating-point) seems a waste of performance, but…please, bear in mind that a so small yet inexpensive board performs similar about the same as an old 1984 IBM PC-AT machine (estimated price US$5000).
The interesting thing is that there’s almost no difference between using Int32 or Single, whose are both 32-bit based. Surprisingly, even choosing Double as type, the calculation takes insignificantly longer than the other cases. However, a Double takes 8 bytes.
Below there are the parts of IL whose depict the operations:
// ...
IL_004e: ldloc.s ix
IL_0050: ldloc.s iy
IL_0052: add
IL_0053: stloc.3
// ...
IL_00eb: ldloc.s fx
IL_00ed: ldloc.s fy
IL_00ef: add
IL_00f0: stloc.s f
// ...
IL_019c: ldloc.s dx
IL_019e: ldloc.s dy
IL_01a0: add
IL_01a1: stloc.s d
// ...
Multiplication.
Here is the snippet…
for (int n = 0; n < Count; n++)
{
i = ix * iy; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
i = ix << 4; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
f = fx * fy; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
d = dx * dy; //repeated 20 times
}
…and here is the timing:

As for the addition, the multiplication takes almost the same time to perform and it seems there’s no significant loss of performance over different data types.
There is an extra-special case, which calculates the multiplication leveraging the left-shift operator. It’s a very particular case, but it’s noticeable the better speed than an ordinary multiplication. Is it worthwhile choosing a shift over a real multiplication? I don’t believe…
Below there are the parts of IL whose depict the operations:
// ...
IL_004e: ldloc.s ix
IL_0050: ldloc.s iy
IL_0052: mul
IL_0053: stloc.3
// ...
IL_00e8: ldloc.s ix
IL_00ea: ldc.i4.4
IL_00eb: shl
IL_00ec: stloc.3
// ...
IL_016e: ldloc.s fx
IL_0170: ldloc.s fy
IL_0172: mul
IL_0173: stloc.s f
// ...
IL_021f: ldloc.s dx
IL_0221: ldloc.s dy
IL_0223: mul
IL_0224: stloc.s d
// ...
Logical AND.
Here is the snippet…
for (int n = 0; n < Count; n++)
{
i = ix & iy; //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
b = (byte)(bx & by); //repeated 20 times
}
…and here is the timing:

It is clear that a logical operation like the AND takes almost the same as an ordinary addition between Int32-s. Instead, the interesting thing is seeing how different is working with Int32 and Byte.
Any .Net Framework operates at least on 32-bits operands (whereas possible it uses 64-bits). Thus, when you constrain your variables to a tiny byte, most operations will cast the values to Int32-s. That takes much more time to do and demonstrates why in the .Net world the speculation habits of small CPUs are wrong.
Below there are the parts of IL whose depict the operations:
// ...
IL_004e: ldloc.s ix
IL_0050: ldloc.s iy
IL_0052: and
IL_0053: stloc.3
// ...
IL_00e8: ldloc.1
IL_00e9: ldloc.2
IL_00ea: and
IL_00eb: conv.u1
IL_00ec: stloc.0
// ...
Min/Max calculation.
Here is the snippet…
for (int n = 0; n < Count; n++)
{
i = System.Math.Min(ix, iy);
i = System.Math.Max(ix, iy);
// ... repeated 10 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
i = ix < iy ? ix : iy;
i = ix > iy ? ix : iy;
// ... repeated 10 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
i = ix; if (ix < iy) i = iy;
i = ix; if (ix > iy) i = iy;
// ... repeated 10 times
}
…and here is the timing:

Please, bear in mind that the time is 5x than the above charts.
Using a library function is preferable: we should avoid “reinventing the wheel” and most of the times a library function embeds native code and yields faster results. However, when that function is particularly simple, it could be better choosing another approach, such as in this example.
The timings clear shows that calling the framework’s Min/Max function takes about three-times than using a trivial ternary-if. Even using a third attempt for calculating the min/max yields no better results other than the most trivial way.
Let’s have a peek at the IL assembly:
// ...
IL_004e: ldloc.s ix
IL_0050: ldloc.s iy
IL_0052: call int32 [mscorlib]System.Math::Min(int32, int32)
IL_0057: stloc.3
// ...
IL_013b: ldloc.s ix
IL_013d: ldloc.s iy
IL_013f: blt.s IL_0145
IL_0141: ldloc.s iy
IL_0143: br.s IL_0147
IL_0145: ldloc.s ix
IL_0147: stloc.3
// ...
IL_0264: ldloc.s ix
IL_0266: stloc.3
IL_0267: ldloc.s ix
IL_0269: ldloc.s iy
IL_026b: clt
IL_026d: ldc.i4.0
IL_026e: ceq
IL_0270: stloc.s CS$4$0000
IL_0272: ldloc.s CS$4$0000
IL_0274: brtrue.s IL_0279
IL_0276: ldloc.s iy
IL_0278: stloc.3
// ...
Sample expression.
Here is the snippet…
for (int n = 0; n < Count; n++)
{
d = ix * (fx + dx) * (fy + dy); //repeated 20 times
}
Pulse();
for (int n = 0; n < Count; n++)
{
d = ix;
d *= fx + dx;
d *= (fy + dy);
// ... repeated 20 times
}
…and here is the timing:

The timings are showing that an inline-expression performs better than a compound operator. That’s normal, because the compiler actually does what the user wrote: store each intermediate operation in the variable. That forces the compiler to avoid optimizations such as in the inline syntax.
The IL opcodes demonstrate the longer task in the second case:
// ...
IL_004e: ldloc.s ix
IL_0050: conv.r8
IL_0051: ldloc.s fx
IL_0053: conv.r8
IL_0054: ldloc.s dx
IL_0056: add
IL_0057: mul
IL_0058: ldloc.s fy
IL_005a: conv.r8
IL_005b: ldloc.s dy
IL_005d: add
IL_005e: mul
IL_005f: stloc.s d
// ...
IL_01ef: ldloc.s ix
IL_01f1: conv.r8
IL_01f2: stloc.s d
IL_01f4: ldloc.s d
IL_01f6: ldloc.s fx
IL_01f8: conv.r8
IL_01f9: ldloc.s dx
IL_01fb: add
IL_01fc: mul
IL_01fd: stloc.s d
IL_01ff: ldloc.s d
IL_0201: ldloc.s fy
IL_0203: conv.r8
IL_0204: ldloc.s dy
IL_0206: add
IL_0207: mul
IL_0208: stloc.s d
// ...
Conclusion.
As a professional programmer, I ma obsessed by well-written source code, patterns, good-practices and so away. However, I also believe it’s useful to know when and how put your finger on a program to get the most from it.
That is also a good programming practice, IMHO.
45.486937
12.295311