How to get our Netduino running faster

January 5, 2013January 6, 2013

Before going on on my graphic library for led matrix, I think it’s time to optimize a bit the code in order to get the Netduino running faster.
My job is programming application using .Net for desktop, but a PC is very rich of resources such as RAM and processor speed. Instead, the Micro Framework offers a very small environment where every byte more might have an impact on the final result.

Here is a brief bunch of tests for showing a comparison on different approaches against a same task. Sometime you don’t care about the best way to write the code, but the interesting thing is actually knowing how the things are working. You will be surprised, as I was.

The test bench.

The base program for the tests is very simple: it is an endless loop where the code under test runs interleaved by a short pause of 100ms. The comparison is mostly against different yet commonly-used types, such as Int32, Single, Double and Byte.
The timings are taken by using a scope, then watching at two output ports when they change their state.
Except for the very first, each test cycles 50 times over a 20-operations set: that for minimize the overhead due to the “for-loop”. By the way, the first test is targeted just for get the “for-loop” heaviness.
It follows the test program template:

namespace PerformanceTest
{
    public class Program
    {
        private const int Count = 50;

        private static OutputPort QTest = new OutputPort(Pins.GPIO_PIN_D0, false);
        private static OutputPort QPulse = new OutputPort(Pins.GPIO_PIN_D1, false);


        public static void Main()
        {
            byte b;
            byte bx = 50;
            byte by = 16;

            int i;
            int ix = 50;
            int iy = 16;

            float f;
            float fx = 50.0f;
            float fy = 16.0f;

            double d;
            double dx = 50.0;
            double dy = 16.0;

            while (true)
            {
                //start of the test
                QTest.Write(true);


                // ... operations to test ...


                //end of the test
                QTest.Write(false);
                Thread.Sleep(100);
            }
        }


        private static void Pulse()
        {
            QPulse.Write(true);
            QPulse.Write(false);
        }

    }
}

The basic for-loop.

Since every test will use the “for-loop”, we should measure how much overhead that introduces.
Here is the snippet…

                for (int n = 0; n < 1000; n++)
                {
                    //do nothing
                }

…and here is the timing:

Roughly speaking, we could say that every for-loop cycle takes about 7 microseconds.

How does look the IL-opcodes generated by the compiler (restricted to the only for-loop)?
Well, it is pretty interesting digging a bit behind (or under?) the scenes. I will take advantage by the awesome ILSpy, which is a free, open-source decompiler, disassembler and much more provided by the SharpDevelop teams.

		IL_0042: ldc.i4.0
		IL_0043: stloc.s n
		IL_0045: br.s IL_004f
		// loop start (head: IL_004f)
			IL_0047: nop
			IL_0048: nop
			IL_0049: ldloc.s n
			IL_004b: ldc.i4.1
			IL_004c: add
			IL_004d: stloc.s n

			IL_004f: ldloc.s n
			IL_0051: ldc.i4 1000
			IL_0056: clt
			IL_0058: stloc.s CS$4$0000
			IL_005a: ldloc.s CS$4$0000
			IL_005c: brtrue.s IL_0047
		// end loop

Notice how the final branch-on-true jumps back to the first opcode, which implies a couple of “nop”s: why?
Anyway, we are not going to optimize the for-loop yet.

Addition.

The addition will be performed over three common types: Int32, Single and Double.
Here is the snippet…

                for (int n = 0; n < Count; n++)
                {
                    i = ix + iy; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    f = fx + fy; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    d = dx + dy; //repeated 20 times
                }

…and here is the timing:

Again, an “average” addition takes about 2 microseconds.

Many users are blaming the poor speed of a board like Netduino, because its core can run at over 200Mips. Two microseconds for an addition (integer or floating-point) seems a waste of performance, but…please, bear in mind that a so small yet inexpensive board performs similar about the same as an old 1984 IBM PC-AT machine (estimated price US$5000).

The interesting thing is that there’s almost no difference between using Int32 or Single, whose are both 32-bit based. Surprisingly, even choosing Double as type, the calculation takes insignificantly longer than the other cases. However, a Double takes 8 bytes.
Below there are the parts of IL whose depict the operations:

                        // ...

			IL_004e: ldloc.s ix
			IL_0050: ldloc.s iy
			IL_0052: add
			IL_0053: stloc.3
                        
                        // ...

			IL_00eb: ldloc.s fx
			IL_00ed: ldloc.s fy
			IL_00ef: add
			IL_00f0: stloc.s f
                        
                        // ...

			IL_019c: ldloc.s dx
			IL_019e: ldloc.s dy
			IL_01a0: add
			IL_01a1: stloc.s d
                        
                        // ...

Multiplication.

Here is the snippet…

                for (int n = 0; n < Count; n++)
                {
                    i = ix * iy; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    i = ix << 4; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    f = fx * fy; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    d = dx * dy; //repeated 20 times
                }

…and here is the timing:

As for the addition, the multiplication takes almost the same time to perform and it seems there’s no significant loss of performance over different data types.
There is an extra-special case, which calculates the multiplication leveraging the left-shift operator. It’s a very particular case, but it’s noticeable the better speed than an ordinary multiplication. Is it worthwhile choosing a shift over a real multiplication? I don’t believe…
Below there are the parts of IL whose depict the operations:

                        // ...

			IL_004e: ldloc.s ix
			IL_0050: ldloc.s iy
			IL_0052: mul
			IL_0053: stloc.3
                        
                        // ...

			IL_00e8: ldloc.s ix
			IL_00ea: ldc.i4.4
			IL_00eb: shl
			IL_00ec: stloc.3
                        
                        // ...

			IL_016e: ldloc.s fx
			IL_0170: ldloc.s fy
			IL_0172: mul
			IL_0173: stloc.s f
                        
                        // ...

			IL_021f: ldloc.s dx
			IL_0221: ldloc.s dy
			IL_0223: mul
			IL_0224: stloc.s d
                        
                        // ...

Logical AND.

Here is the snippet…

                for (int n = 0; n < Count; n++)
                {
                    i = ix & iy; //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    b = (byte)(bx & by); //repeated 20 times
                }

…and here is the timing:

It is clear that a logical operation like the AND takes almost the same as an ordinary addition between Int32-s. Instead, the interesting thing is seeing how different is working with Int32 and Byte.
Any .Net Framework operates at least on 32-bits operands (whereas possible it uses 64-bits). Thus, when you constrain your variables to a tiny byte, most operations will cast the values to Int32-s. That takes much more time to do and demonstrates why in the .Net world the speculation habits of small CPUs are wrong.
Below there are the parts of IL whose depict the operations:

                        // ...

			IL_004e: ldloc.s ix
			IL_0050: ldloc.s iy
			IL_0052: and
			IL_0053: stloc.3
                        
                        // ...

			IL_00e8: ldloc.1
			IL_00e9: ldloc.2
			IL_00ea: and
			IL_00eb: conv.u1
			IL_00ec: stloc.0
                        
                        // ...

Min/Max calculation.

Here is the snippet…

                for (int n = 0; n < Count; n++)
                {
                    i = System.Math.Min(ix, iy);
                    i = System.Math.Max(ix, iy);
                    // ... repeated 10 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    i = ix < iy ? ix : iy;
                    i = ix > iy ? ix : iy;
                    // ... repeated 10 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    i = ix; if (ix < iy) i = iy;
                    i = ix; if (ix > iy) i = iy;
                    // ... repeated 10 times
                }

…and here is the timing:

Please, bear in mind that the time is 5x than the above charts.

Using a library function is preferable: we should avoid “reinventing the wheel” and most of the times a library function embeds native code and yields faster results. However, when that function is particularly simple, it could be better choosing another approach, such as in this example.
The timings clear shows that calling the framework’s Min/Max function takes about three-times than using a trivial ternary-if. Even using a third attempt for calculating the min/max yields no better results other than the most trivial way.
Let’s have a peek at the IL assembly:

                        // ...

			IL_004e: ldloc.s ix
			IL_0050: ldloc.s iy
			IL_0052: call int32 [mscorlib]System.Math::Min(int32, int32)
			IL_0057: stloc.3
                        
                        // ...

			IL_013b: ldloc.s ix
			IL_013d: ldloc.s iy
			IL_013f: blt.s IL_0145

			IL_0141: ldloc.s iy
			IL_0143: br.s IL_0147

			IL_0145: ldloc.s ix

			IL_0147: stloc.3
                        
                        // ...

			IL_0264: ldloc.s ix
			IL_0266: stloc.3
			IL_0267: ldloc.s ix
			IL_0269: ldloc.s iy
			IL_026b: clt
			IL_026d: ldc.i4.0
			IL_026e: ceq
			IL_0270: stloc.s CS$4$0000
			IL_0272: ldloc.s CS$4$0000
			IL_0274: brtrue.s IL_0279

			IL_0276: ldloc.s iy
			IL_0278: stloc.3
                        
                        // ...

Sample expression.

Here is the snippet…

                for (int n = 0; n < Count; n++)
                {
                    d = ix * (fx + dx) * (fy + dy); //repeated 20 times
                }

                Pulse();

                for (int n = 0; n < Count; n++)
                {
                    d = ix; 
                    d *= fx + dx; 
                    d *= (fy + dy);
                    // ... repeated 20 times
                }

…and here is the timing:

The timings are showing that an inline-expression performs better than a compound operator. That’s normal, because the compiler actually does what the user wrote: store each intermediate operation in the variable. That forces the compiler to avoid optimizations such as in the inline syntax.
The IL opcodes demonstrate the longer task in the second case:

                        // ...

			IL_004e: ldloc.s ix
			IL_0050: conv.r8
			IL_0051: ldloc.s fx
			IL_0053: conv.r8
			IL_0054: ldloc.s dx
			IL_0056: add
			IL_0057: mul
			IL_0058: ldloc.s fy
			IL_005a: conv.r8
			IL_005b: ldloc.s dy
			IL_005d: add
			IL_005e: mul
			IL_005f: stloc.s d
                        
                        // ...

			IL_01ef: ldloc.s ix
			IL_01f1: conv.r8
			IL_01f2: stloc.s d
			IL_01f4: ldloc.s d
			IL_01f6: ldloc.s fx
			IL_01f8: conv.r8
			IL_01f9: ldloc.s dx
			IL_01fb: add
			IL_01fc: mul
			IL_01fd: stloc.s d
			IL_01ff: ldloc.s d
			IL_0201: ldloc.s fy
			IL_0203: conv.r8
			IL_0204: ldloc.s dy
			IL_0206: add
			IL_0207: mul
			IL_0208: stloc.s d
                        
                        // ...

Conclusion.

As a professional programmer, I ma obsessed by well-written source code, patterns, good-practices and so away. However, I also believe it’s useful to know when and how put your finger on a program to get the most from it.
That is also a good programming practice, IMHO.

13 thoughts on “How to get our Netduino running faster”

Georgi
January 28, 2013 at 7:33 pm

Reply

Great post Mario!

One think – you can go even deeper under the hood. If you place a brake point and start debugging, when the break point is hit, right click and “Go to Disassembly”. That would show you actual CPU instructions. You might find out it is better to inline assembler (through importing unsafe c++ library) yourself to squeeze maximum performance. 🙂
- Mario Vernari
  January 29, 2013 at 6:16 am
  
  Reply
  
  Hello Georgi. I’m not sure to understand what you mean. I do understand the attempt to see the disassembly on a break, but how to get the program faster?
  I also tried some “unsafe” way to get the code faster, but it seems that pointers and else are not supported.
  Cheers
  - Georgi
    January 29, 2013 at 9:47 am
    
    What I meant is the following: If speed is absolute requirement you can use C++ to line assembler code. You can even use MS Assembler itself. Once you have the DLL library, you can import it in C# and use the functions, thus gaining even more execution speed.
    
    Just an idea of course. Is it actually worth the effort is other thing 🙂
Ark-kun (@Ark_kun)
January 29, 2013 at 6:14 pm

Reply

>IL_0047: nop
??? NOPs indicate that you’re running a debug build without any optimizations. Why are you doing this? The release build should have removed the whole loop.
- Mario Vernari
  January 30, 2013 at 5:40 am
  
  Reply
  
  Short answer: because I am dumb.
  Long answer: I did not think to use the release instead of the debug, and I will perform some tests with the optimization as well, just for comparison.
  However, consider that the purpose of the article is toward the *relative* speed of various operations. Even switching the opt on, I’d expect an overall speed up, but the ratio should remain the same.
  Finally, I briefly checked the IL for the release, but an empty loop is preserved although is more tight than the debug’s one.
  Thank you very much for pointing the problem out.
  Cheers
  - Ark-kun (@Ark_kun)
    February 13, 2013 at 7:40 pm
    
    >However, consider that the purpose of the article is toward the *relative* speed of various operations.
    Turning optimizations on/off affects various operations very differently.
    For example, I don’t think that this will hold : “The timings are showing that an inline-expression performs better than a compound operator. ”
    
    >Finally, I briefly checked the IL for the release, but an empty loop is preserved although is more tight than the debug’s one.
    There is also the final optimization – JIT. I don’t know whether it’s JIT which should remove the unneeded code.
    
    I’m doing my own performance tests right now. I’m measuring many types of calls (static, instance, virtual, delegate, dynamic, reflection, etc). It’s interesting that delegate calls have nearly same speed as normal call. But delegate that had not one, but two subscribers is not 2, but 3-5 times slower.
    
    P.S. Running Realease builds under Visual Studio is still slower than running them manually. I’ve had x1.5 – x2 speed boosts for small operations like method calls.
  - Mario Vernari
    February 14, 2013 at 5:02 am
    
    If you check the resulting IL, the compound operators produce more instructions than a single explicit expression. That’s both for the debug and the release. In the expression case there’s no need of storing back an intermediate value, because the program does not want this. Instead, by composing the calculation over several smaller pieces, you implicitly say the compiler to do that…and it does.
    
    There’s no JIT in the Micro Framework: the processor runs an interpreter of the IL code.
    
    What do you mean as “running release manually”?
Pingback: Speeding up your Netduino project (a view from a Desktop dev) - Russian Coding 4 Fun - Site Home - MSDN Blogs
Ark-kun (@Ark_kun)
February 22, 2013 at 7:27 pm

Reply

>Instead, by composing the calculation over several smaller pieces, you implicitly say the compiler to do that…and it does.
If you think about it, the compiler preserves correctness. Suppose there is a parallel thread that constantly reads the d variable or writes some random stuff to it. In the single-expression case, another thread cannot ever see the intermidiate results or influence the computation by rewriting d. The multi-expression case allows such interactions. The C# compiler naturally preserves the behavior of the programs it compiles. Interesting find though.

>What do you mean as “running release manually”?
It means starting the program NOT from Visual Studio. Another option would be to disable JIT optimization suppression. Read about all of this here: http://blogs.msdn.com/b/vancem/archive/2006/02/20/535807.aspx

I’ve looked at the optimized/non-optimized Release build disassembly. The non-optimized single-statement function has the entire loop body removed. The non-optimized multi-statement function has the loop body intact. When both functions are JIT-optimized nearly everything is removed – only the loop remains (all other variables are eliminated). I’ll try to post the disassemblies.

P.S. You should use the loop variable inside the loop. More importantly you should do something with the result to prevent it from being optimized away.
- Mario Vernari
  February 23, 2013 at 6:02 am
  
  Reply
  
  I believe you’re mistaking the context of my article: everything is related to the .Net *Micro* Framework, not the ordinary one.
  So:
  – there’s no native compilation at all (nor any kind of JIT), and the IL-code generated is interpreted by the board itself;
  – most of the boards running the code are ARM-based, not x86 or else;
  – the threading model is much simplified than the classic one, especially because the cpu does not have any native/hardware task switching: the threads has to be managed by the firmware;
  – I’m not sure at 100%, but I believe that there’s no difference in terms of speed when the IDE is alive or not. The app must be running into the board: at most the IDE is “attached” for debugging.
  
  I thank you very much for your explanation, as well as the code listings, but I guess we’re talking about two dramatically different things!
  Cheers
Ark-kun (@Ark_kun)
February 22, 2013 at 7:28 pm

Reply

single-statement Release
— c:\Users\Ark-kum\Documents\Visual Studio 11\Projects\ConsoleApplication1\Program.cs
int ix = 50;
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,20h
00000006 cmp dword ptr ds:[007B0AFCh],0
0000000d je 00000014
0000000f call 5717488A
00000014 xor edx,edx
00000016 mov dword ptr [ebp-20h],edx
00000019 xor edx,edx
0000001b mov dword ptr [ebp-4],edx
0000001e fldz
00000020 fstp qword ptr [ebp-14h]
00000023 fldz
00000025 fstp qword ptr [ebp-1Ch]
00000028 fldz
0000002a fstp dword ptr [ebp-0Ch]
0000002d fldz
0000002f fstp dword ptr [ebp-8]
00000032 mov dword ptr [ebp-4],32h
int iy = 16;

float f;
float fx = 50.0f;
00000039 mov dword ptr [ebp-8],42480000h
float fy = 16.0f;
00000040 mov dword ptr [ebp-0Ch],41800000h

double d;
double dx = 50.0;
00000047 fld dword ptr ds:[004728A8h]
0000004d fstp qword ptr [ebp-14h]
double dy = 16.0;
00000050 fld dword ptr ds:[004728B0h]
00000056 fstp qword ptr [ebp-1Ch]
for (int n = 0; n < Count; n++) {
00000059 xor edx,edx
0000005b mov dword ptr [ebp-20h],edx
0000005e nop
0000005f jmp 00000064
00000061 inc dword ptr [ebp-20h]
00000064 cmp dword ptr [ebp-20h],32h
00000068 jl 00000061
d = ix * (fx + dx) * (fy + dy); //repeated 20 times
}
}
0000006a nop
0000006b mov esp,ebp
0000006d pop ebp
0000006e ret
Ark-kun (@Ark_kun)
February 22, 2013 at 7:32 pm

Reply

milti-statement Release
— c:\Users\Ark-kum\Documents\Visual Studio 11\Projects\ConsoleApplication1\Program.cs
int ix = 50;
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,28h
00000006 cmp dword ptr ds:[009F0AFCh],0
0000000d je 00000014
0000000f call 571347FA
00000014 xor edx,edx
00000016 mov dword ptr [ebp-28h],edx
00000019 xor edx,edx
0000001b mov dword ptr [ebp-4],edx
0000001e fldz
00000020 fstp qword ptr [ebp-14h]
00000023 fldz
00000025 fstp qword ptr [ebp-1Ch]
00000028 fldz
0000002a fstp qword ptr [ebp-24h]
0000002d fldz
0000002f fstp dword ptr [ebp-0Ch]
00000032 fldz
00000034 fstp dword ptr [ebp-8]
00000037 mov dword ptr [ebp-4],32h
int iy = 16;

float f;
float fx = 50.0f;
0000003e mov dword ptr [ebp-8],42480000h
float fy = 16.0f;
00000045 mov dword ptr [ebp-0Ch],41800000h

double d;
double dx = 50.0;
0000004c fld dword ptr ds:[004B2960h]
00000052 fstp qword ptr [ebp-1Ch]
double dy = 16.0;
00000055 fld dword ptr ds:[004B2968h]
0000005b fstp qword ptr [ebp-24h]
for (int n = 0; n < Count; n++) {
0000005e xor edx,edx
00000060 mov dword ptr [ebp-28h],edx
00000063 nop
00000064 jmp 00000087
d = ix;
00000066 fild dword ptr [ebp-4]
00000069 fstp qword ptr [ebp-14h]
d *= fx + dx;
0000006c fld dword ptr [ebp-8]
0000006f fadd qword ptr [ebp-1Ch]
00000072 fmul qword ptr [ebp-14h]
00000075 fstp qword ptr [ebp-14h]
d *= (fy + dy);
00000078 fld dword ptr [ebp-0Ch]
0000007b fadd qword ptr [ebp-24h]
0000007e fmul qword ptr [ebp-14h]
00000081 fstp qword ptr [ebp-14h]
for (int n = 0; n < Count; n++) {
00000084 inc dword ptr [ebp-28h]
00000087 cmp dword ptr [ebp-28h],32h
0000008b jl 00000066
// … repeated 20 times
}
}
0000008d nop
0000008e mov esp,ebp
00000090 pop ebp
00000091 ret
Ark-kun (@Ark_kun)
February 22, 2013 at 7:35 pm

Reply

single-statement optimized Release
— c:\Users\Ark-kum\Documents\Visual Studio 11\Projects\ConsoleApplication1\Program.cs
int ix = 50;
00000000 push ebp
00000001 mov ebp,esp
for (int n = 0; n < Count; n++) {
00000003 xor eax,eax
00000005 inc eax
00000006 cmp eax,32h
00000009 jl 00000005
0000000b pop ebp
d = ix * (fx + dx) * (fy + dy); //repeated 20 times
}
}
0000000c ret