Netduino SPI: “S” is for “Speed”


One of the most used peripheral of the Netduino microcontroller is the SPI. Its simplicity together with its speed makes it a very good medium to exchange data from the microcontroller (thus your program) and other external devices. The marker is plenty of shields and circuits basing their data exchange on SPI, plus a vast series of ICs are SPI-ready, making a logic connection a pretty easy task.
There are several articles, tutorial and programs demonstrating how to use the SPI: most of them are related to some specific shield. That’s not what we are talking about; instead it is interesting to point out how to interface our own logic, and obtaining a good performance.

Using the SPI.

Many times we need to expand the number of I/Os of our board. Sometimes we need a solution to realize a parallel transfer, since the base framework does not offer that feature.

A simple solution to achieve that is using a normal shift-register chip. As the name suggests, a shift-register takes just one bit at once as input, and shifts it along a byte-register (8 cells). The shift is not automatic, but it must be accomplished by the “clock”: each clock event means a bit-shift. There is not any constraint on the number of clocks applied: simply the data overflow, and we must take care about the exact number of clocks.
This brief description depicts a “serial-in-parallel-out” logic, but there are the “parallel-in-serial-out” and other flavors as well.
All is well described as “synchronous-communication”. Vice versa, an “asynchronous-communication” (e.g. the UART) relies on the implicit matching of the data-rate: if they don’t match the device involved don’t understand each other.
When we have to connect chips together (i.e. very short distances) the synchronous choice is surely better: offers high throughput, reliability, frequency independence and lot more. Its price is a small logic and a certain number of lines to manage it.
When is a synchronous choice disallowed? When we need to exchange data over relatively long distances and when we cannot afford the payload of many wires to connect the devices.

Visual devices connected via SPI.

When we begin playing with Netduino, the first experiences are driving leds, lcds and other “visual” stuffs. That’s absolutely normal, because the visual-way is the most intuitive and direct solution for having a concrete feedback on what our program is doing.
Why should the SPI be involved with visual devices?
Well, just after our “hello-world” program being able only to blink a led, we would try connecting two-three or more leds. Our Netduino luckily provides many I/O ports, so it is very easy to add even a dozen of leds. What’s the sense of wasting all your precious ports just for driving leds?
The Mighty Stefan is one among the first running this “gold rush” around the shift-registers. He enjoyed so much the connection of a shift-register that drove 8 leds, who wanted going further, chaining several chips as a cascade of bits.

At the moment I am writing, I know he connected up to 5 shifters linked together.

However, my very first attempt to use a shift-register with my Netduino was for driving a LCD module. I found reading text much exciting than seeing a psychedelic game of leds. Szymon presented a very good tutorial how to interface a common 16×2 LCD module using a 75HC595 shift-register. It worked at first run.
I would point out that here we will talk about the 74HC595 chip, but there is *not* anything of specific on it. This model is often used, due to its versatility, and it would be preferable to keep the discussion on well-known devices, so everyone can test it easily.

Where is this article going to?

Well, the “problem” could not be an actual problem; it depends on what we are looking for. However there are situations where a simple connection of a shift-register doesn’t solve our problems, otherwise it may happen that the data speed is far from the “megabits-promise” of the manufacturer. The microcontroller specifications say that the SPI could reach even the processor clock, but that sounds much more a theoretical value than an effective data-rate.
We will analyze two hardware solutions to connect a shift-register: one is trivial, the other one is more sophisticated, but it offers a much higher performance.
A generic task will be considered in the two contexts; the program should transfer a series of bytes to an external device, but each byte available should be “notified” to the target consumer. For example we may suppose to feed a parallel DAC, where each 8-bit sample must be latched onto the converter, so that the analog output will be set accordingly. Please, consider this example as merely illustrative. Several DACs have the SPI interface built-in.
The SPI clock frequency is the same for both the circuits and it has be chosen as 2 MHz.

The easy way.

Let’s first analyze how it works the simplest and most common connection of a 74HC595 shift-register.
Here is the sample code for the test. Our goal is to transfer a series of bytes to the register as fast as possible. The buffer transfer is repeated indefinitely, only waiting for a short pause between the cycles. Along a buffer transfer we should expect a byte-rate of approximately 2M / 8 bit = 250 KBytes/s.
Will we able to reach that?

    public class Program
        private static SPI SPIBus;

        public static void Main()
            // Defines the first SPI slave device with pin 10 as SS
            SPI.Configuration Device1 = new SPI.Configuration(
                Pins.GPIO_PIN_D10, // SS-pin
                false,             // SS-pin active state
                0,                 // The setup time for the SS port
                0,                 // The hold time for the SS port
                true,              // The idle state of the clock
                true,              // The sampling clock edge (this must be "true" for the 74HC595)
                2000,              // The SPI clock rate in KHz
                SPI_Devices.SPI1   // The used SPI bus (refers to a MOSI MISO and SCLK pinset)

            // Initializes the SPI bus, with the first slave selected
            SPIBus = new SPI(Device1);


        /// Send 8 bytes out to the SPI (one byte at once)
        private static void DoWorkSlow()
            //set-up a one-byte buffer
            byte[] buffer = new byte[1];

            while (true)
                for (int i = 0; i < 8; i++)
                    buffer[0] = (byte)i;



Note: part of the code was “stolen” from the Stefan’s tutorial on SPI, from the Netduino wiki section.

The code shows clearly an overhead of computation, because there’s no way to send a single byte directly the SPI driver. Instead, we must create a single-byte buffer and then populate the unique cell with the desired value.
Another fault of this approach is that we *must* take care of the sending of every single byte, while it could be much more useful making other operations.
Here is the logical timing sequence. Note that on this chart the time proportion is not respected.

Every byte (i.e. every Write call), the SSEL line (Slave Select) will fall and keeps low as long the stream is over. Since the “stream” is just one byte, the SSEL will rise after the 8th bit.
The clock (SCLK) pulses for 8 periods. The data (MOSI) is shifted out the Netduino on each clock falling edge. That is because the 74HC595 needs to sample the data (MOSI) when its clock input rises. To avoid any misinterpretation of data, the best thing is keeping the data perfectly stable during the SCLK rising edge.
The rising edge of the SSEL line is also used to latch the byte shifted on the 74HC595 parallel output.
All that does what we expect, but…what is the real timing?

It is easy to see the 8 single-byte transfer, separated by the pause. It seems that the real delay is almost 6ms instead of 5, but maybe this is not a problem.
Much more interesting is measuring the time that elapses from the beginning of a byte and the next one. For example, this time could be taken as the period between two consecutive rising edges of the SSEL line.
The scope shows 368us, that is about 2.7 KBytes/s: around 100 times slower than the expected rate!

Finally, here is both the schematic and the breadboard layout, for the ones want to build and test this circuit.

The smart way.

The program is almost the same as before, even much simpler and efficient, because the buffer will be sent “as-is”. Our application doesn’t care about how the bytes are shifted out: feels like a heaven!
Here is only the difference from the code above.

        /// Send 8 bytes out to the SPI (buffer at once)
        private static void DoWorkFast()
            //set-up the desired buffer
            byte[] buffer = new byte[]

            while (true)

Please, bear in mind that the SPI clock frequency is still the same as before, so the expectation is always a throughput of 250 KBytes/s.

The logical timing sequence of the SPI is similar to the one-byte transfer, but now there is a problem: the SSEL line rises at the end of the last byte. How could be latched the preceding bytes onto the 74HC595 register?
We must add some logic to help the circuit. Here is the revised logical timing sequence.

The extra logic must be able to provide a latch clock to the 74HC595 exactly after every 8th bit. To achieve this, we must consider a counter, such as a normal 3-stages counter, because we need to count just up to 8. This counter should also “trigger” some other logic so that we obtain a pulse. Its rising edge can finally latch the data on the register.

The counter is a 74HC4040. It is a 12 stages binary-counter, increments its value every falling edge of the clock, and could be reset pulling high the related pin. In fact, the reset input is connected to the SSEL line, so that when the transfer begins we are guaranteed that the counter is zero.
The SCLK from the Netduino must feed the counter, but we need to invert it. Remember? The 74HC595 shifts on the rising edge of the clock, so the counter must increment itself at the same time. To invert the SCLK line I used a 40106: it is an old Cmos, not as fast as an HCmos, but it embeds 6 inverters with Schmitt-trigger, and it allows a very large power supply range.
Here is the detail of the SCLK (light blue) and the output of the first stage of the 74HC4040 (yellow). Note the propagation delay from the rising edge of the SCLK and the output change is almost comparable with the half of the clock period. Not a good thing and we should have chosen some chip performing better than the 40106.

The 3-stages counter starts with 000, then 001, then 010, up to 111, then rolls back to zero. We will take advantage from the third bit, because it falls from 1 to 0 just after 8 clocks. At this point, the only problem is to create a pulse triggered from the falling edge of the counter output.
The pulse generator is realized by a small analog circuit (R, C and diode), along with a couple of 40106 inverters.

  • Consider the Q2 output of the 74HC4040 at logic level “1” (+5V), just before is dropping to zero. Across the capacitor the voltage is zero, because there is either +5V from the counter side, and +5V on the TRIG point, pulled up by the resistor.
  • As soon the Q2 falls to zero, the capacitor keeps its voltage drop, bringing the TRIG point toward zero Volts as well.
  • However, through the resistor will flow current that charges the capacitor, so the voltage on the TRIG point begins to rise with an exponential shape.
  • Now, the Q2 output switches to the high level. Again, the capacitor tries to keep its voltage drop pushing the TRIG point over the +5V, but the diode limits it.

The Schmitt-trigger inverters need because the signal on input is analog and it is useful to manage it as best, to prevent unpredictably behavior during the transitions. Two inverters in cascade mean just no logic inversion. We only need the Schmitt-trigger capabilities.
Here follows the detail of the pulse succeeding the 8-clocks sequence. Again, note the not-so-sharp fall and rise of the pulse due to the poor performance of the 40106.

The “smart” (and complex) solution has been built. What about the data transfer performance?
Here is the picture showing the buffer transfer separated by the Sleep pause. Again the spacing is much more toward 6ms instead of 5, but that’s not what we are looking for.

Here is the interesting chart: the scope shows clearly a huge increment of speed between a byte and the next one. The actual byte period is about 5.5 us, which is over 180 KBytes/s. This is still far from the theoretical value of 250K, but it seems the best performance the Netduino can do (keeping stable the clock frequency at 2MHz).


Using this pretty simple extra logic we are able to hugely increment the actual performance of the Netduino SPI. The analog trick is not a reliable way to manage the digital circuits, but sometimes is acceptable.
This article was a half-a-way step to explain a technique to improve the SPI throughput, because it needs for my next project.
Stay tuned!


17 thoughts on “Netduino SPI: “S” is for “Speed”

  1. Stefan Thoolen

    > At the moment I am writing, I know he connected up to 5 shifters linked together.
    Actually I’ve done more 🙂 I think the limit I did so far was 10 chips daisy chained, without much loss of performance.

    • Mario Vernari

      Of sure there is a limit on the number of chips based on the SPI speed, because we must take in account the propagation delays.
      Anyway, this post would point out the really bad performance without some hardware trick. As you read, even setting the SPI clock at 2MHz, the actual data rate is some KBytes/sec.

  2. Androino

    Hi, i saw your post in Netduino about: Frequency shift keying through audio signals.
    and i would like to comment you that I’m working on something similar,

    I’m Alberto a student from Spain, who is working on his Final Project as
    Telecommunication Engineering. This is my blog where i explain what i’m
    doing and what are my objectives.

    If you look at Project Description, you can see that my project will need to use something quite
    similar to what you are talking about. I found this blog while I was looking for some
    information about the technologies used nowadays.

    I would like to keep in touch and listen to all the suggestion you
    can make in order to improve my project.

    Thank you very much,

    I’m looking forward to hearing from you.

    I also encourage to create your own blog where all your improvements can be analyzed .

  3. frabor

    is there any reason why in the code above you did not use a byte type instead a int?

    for (int i = 0; i < 8; i++)

    buffer[0] = (byte)i;

    • Mario Vernari

      The reason is that .Net works internally always with Int32. Almost any operation is converted to Int32 if it’s not yet.
      So, I prefer to minimize the number of casts, and use it only when needed.
      In this particular case there’s no difference: the snippet below shows two loops, having the indexing type different.

      int dummy = 0;
      byte[] buffer = new byte[66];
      dummy = 77;

      for (byte i = 0; i < 44; i++)
      buffer[i] = i;

      dummy = 88;

      for (int i = 0; i < 55; i++)
      buffer[i] = (byte)i;

      dummy = 99;

      Now, let’s take a look at the IL-disassembly:

      IL_0008: ldc.i4.0
      IL_0009: stloc.0

      IL_000a: ldc.i4.s 66
      IL_000c: newarr [mscorlib]System.Byte
      IL_0011: stloc.1

      IL_0012: ldc.i4.s 77
      IL_0014: stloc.0

      //here is the first loop
      IL_0015: ldc.i4.0
      IL_0016: stloc.2
      IL_0017: br.s IL_0024
      IL_0019: nop
      IL_001a: ldloc.1
      IL_001b: ldloc.2
      IL_001c: ldloc.2
      IL_001d: stelem.i1
      IL_001e: nop
      IL_001f: ldloc.2
      IL_0020: ldc.i4.1
      IL_0021: add
      IL_0022: conv.u1
      IL_0023: stloc.2
      IL_0024: ldloc.2
      IL_0025: ldc.i4.s 44
      IL_0027: clt
      IL_0029: stloc.s CS$4$0000
      IL_002b: ldloc.s CS$4$0000
      IL_002d: brtrue.s IL_0019

      IL_002f: ldc.i4.s 88
      IL_0031: stloc.0

      //here is the second loop
      IL_0032: ldc.i4.0
      IL_0033: stloc.3
      IL_0034: br.s IL_0041
      IL_0036: nop
      IL_0037: ldloc.1
      IL_0038: ldloc.3
      IL_0039: ldloc.3
      IL_003a: conv.u1
      IL_003b: stelem.i1
      IL_003c: nop
      IL_003d: ldloc.3
      IL_003e: ldc.i4.1
      IL_003f: add
      IL_0040: stloc.3
      IL_0041: ldloc.3
      IL_0042: ldc.i4.s 55
      IL_0044: clt
      IL_0046: stloc.s CS$4$0000
      IL_0048: ldloc.s CS$4$0000
      IL_004a: brtrue.s IL_0036

      IL_004c: ldc.i4.s 99
      IL_004e: stloc.0

      However, if I were adding an offset, the operations would be more.

      Finally, I prefer avoid unsigned types because are often tricky.

  4. frabor

    Thanks Mario. Really instructive 🙂 I am working on a SPI prioject and I have been going trough and enjoying your posts regarding how to speed up the bus response.

    Still what has me puzzled is how they could avoid slow bus arbitration on the the Netuido+ since the card reader is attached to the second SPI port of the atmel

  5. CCa

    Thank you for this really instructive post. I am new on netduino community and I am trying to get the best from this hardware . Do you think it could be possible to set the clock frequency up to 4 or 8 Mhz instead 2Mhz to improve performances? I guess that the limiting factor should be the latency introduced by the 40106 circuitry…What do you think?
    (I will try it soon by myself I am currently working around a color tft screen 320*240 color )

    • Mario Vernari

      Oh, the limiting factor is not the 40106, which can be replace by the 74HC14, for instance. You should only take care about the target device specs.
      Hope it helps.

  6. CCa

    Yes it helps.I will try do drive as quick as possible my 320*240 pixels tft color screen with Nedtuino with this tips and keep you informed! Thank you!

  7. Steven Don

    I had some trouble building this because the inverters I have (74C04N) are not Schmitt-Triggers. This meant I couldn’t use it to generate the reset pulse from the analog pulse generator. A simple (2N2221A) transistor did the trick though. Got me beautifully square pulses!

    • Mario Vernari

      Hello Steven.
      Do you mean the transistor’s emitter to the ground and the collector with a pull-up?
      If so, that’s very nice: much better than the chip!

      • Steven Don

        Yep, analog pulse goes into base, ground to emitter, collector to +5V with a 330R and to the 74HC595’s pin 12 (STCP). It was the only 1-bit ADC I could think of with the components I had on hand.

  8. Sylvain Paris

    Great blog, interesting stuff!

    Do you have an explanation why the measure SPI frequency is different from the programmed frequency ? This is not related to managed code drawbacks as SPI commands are directly addressed to the micro controller ?

    Thanks !

    • Mario Vernari

      Hmm…not sure to understand what you mean.
      Yes, the actual clock-rate is roughly matching what’s in the code set.
      The problem is when you have to send several bytes, giving some sync-pulse each one. Either you send byte-by-byte (but it’s very very slow), or you send a buffer (but you can’t give the sync). With this trick you can leverage the full-speed SPI clock, yet giving the sync to an external logic.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s