FLEXIO - TeensyUser/doc GitHub Wiki

Written by @miciwan


Hi everyone,

I'm slowly working though different parts of the documentation, to build the enough understanding of the i.MX RT1060 to use it in one of my projects. Two important components will be a LCD screen and a camera sensor, and my aim is to have them both running at the highest possible framerates, to provide a good user experience. For that, I would like to utilize the parallel interfaces of both the sensor and the screen, and I was looking for ways to do it efficiently on the microcontroller side.

I initially looked at just doing DMAs. When digging through the docs, I kept writing my thought process up, and I posted it here:https://forum.pjrc.com/threads/63353...l=1#post266991. It seemed however, that with DMAs alone on the GPIO I simply won't be able to get the speeds I would like (the camera sensor I have in mind can run at 75MHz; DMA seems to be loosing data if going faster than 10MHz), so I started looking into some other options.

(just for reference: the docs are here: https://www.pjrc.com/teensy/IMXRT1060RM_rev2.pdf, KurtE pinout diagrams are here: https://github.com/KurtE/TeensyDocuments/blob/master/Teensy4.1%20Pins.pdf and here:https://github.com/KurtE/TeensyDocuments/blob/master/Teensy4.1%20Pins%20Mux.pdf)

After skimming through the docs, the CMOS sensor interface (Chapter 34 in the docs) and the Enhanced LDC interface (Chapter 35) seemed like an option. The first one has two problems: first my sensor outputs raw, 12-bit data, but the CMOS interface can only supports raw data in 10 bit width. Second, only 8 of its data lines are exposed as Teensy pins. These are actually the top 8 bits of the 10 bit bus supported by that interface, but I don't want to lose 4 bits of precision, so it's a no-go for me. The LCD sensor interface has even fewer pins exposed (only 5 iirc), which makes it pretty much unusable. Oh well, back to the drawing board, or rather the docs.

If you scroll the table of content further, there's however this section on FlexIO - and it seemed exactly like something I could use. Since there is fairly little information on it available (I think the only bigger project is KurtE SPI code on GitHub), I thought that I'll do the same thing as I did with the DMA and document how I approached my problem, step by step, with references to the docs, so if you're unfamiliar with the system, just like I was, you can follow along and adjust it as you go to your particular use case. Here it is.

Side comment: interestingly, once you get this general intuition about the microcontroller-docs-lingo, the FlexIO documentation is actually not that hard to parse. It's all there in Chapter 50. While it's not as nicely packaged as the Raspberry Pi Pico PIO (which I think is absolutely amazing from the concept perspective - moving to a dedicated interpreter for driving IO from this convoluted, fixed function setup reminds me of how GPUs moved from fixed function register combiners to assembly shaders back in the early 2000s) it seems like it's not really anything less expressive.

How to start using FlexIO

(I highly suggest going through the DMA example first, as it's simpler and introduces a lot of these concepts - that are not really complicated, but take a while to get used to, if you don't have a lot of experience with microcontroller (like me).)

Under the hood, the FlexIO is a collection of shifters and timers. In the Teensy 4 version, they have 8 shifters and 8 timers. The numbers are actually reported in the PARAM register (page 2914). One caveat here: the docs only describe 4 shifters in the registers section (and only 4 shifters are exposed as defines in the Teensy core header files), while the textual description correctly talks about 8 shifters. You will need to add a couple extra #defines if you want to access the shifters 4-7. As usual, all the configuration is done through a number of memory mapped registers. The table with the list of them is on page 2912.

The shifters are 32 bit registers, that can, well, shift the data. When shifting the data, the input appears on the top bits, and everything already in the shifter shifts right, and the lowest bits are output by the shifter (nice diagram on page 2885). So if you configure the shifter to the receive mode it will shift in the incoming data to the top bits. If you set it up to the transmit mode, it will spit out the low bits. The cool feature here is the data can be sourced or spit out to either the output pins, but also to a neighboring shifter. This allows to create a chain of shifters, one outputting to the next, and only the last one actually driving the pins.

Why would you do something like that? For buffering. This way you can buffer the incoming/outgoing data and the CPU only has to take care of it every now and then, instead of every time new piece of data appears (as it was the case with the DMA) - so exactly what we wanted. You might however wonder what happens with the data between the point in time when the shifters are already full, but before the CPU picks the data up (or between the point when all the previously data is shifted out and new data is provided by the CPU). There's actually one more, crucial element there: shifter buffer. Every now and then (it's for you to decide!) the data is moved from the shifter to the shifter buffer (in case of receiving the data; or from the shifter buffer to the shifter in case of transmitting) and the CPU gets notified (in a form of interrupt request, DMA request, or you can also manually check the status bit in a register). The new data can be fetched from the shifter buffer, while the shifters continue their job and receive more data. After the data is read from the shifter buffer, the notification is automatically cleared and the shifters can put the new data there when it's ready.

In short:

  • data comes from the external pins into the shifter
  • every some number of shifts it gets transferred to the shifter buffer and the CPU gets notified
  • the CPU reads the data from the shifter buffer

The shifter buffers are available as SHIFTBUFa registers. They are continuous memory, which is useful when filling them, and they also come in a bunch of versions with swapped individual bits, nibbles, bytes. Pages 2929+ describe all the variations.

All of the shifters can shift in/out one bit at a time from/to the external pins. This doesn't really mean that you can only implement serial protocols - anything parallel just needs the data to be transposed first and transmitted through some number of pins at the same time. And additionally shifters 0 and 4 can shift the data OUT to multiple pins at a time, and shifters 3 and 7 can shift data IN from multiple pins at a time. For example, if the shifter is set out for parallel output of 4 bits, instead of shifting by 1 bit, it's shifted by 4, and the lowest 4 bits are put on the output pins. As usual, there are caveats - the pins used for parallel input/output must be continuous - so if you want 8 bit parallel output, you need 8 consecutive FlexIO pins. To save you trouble of looking it up: out of the three FlexIO modules, only the third one (FlexIO3) has more than 4 consecutive pins exposed in Teensy (it actually has 20 - FlexIO3:0 - FlexIO3:19). The problem is that FlexIO3 is nerfed, and cannot generate DMA requests, that makes it more a pain in a butt to use (because you either have to busy wait, polling the status and filling/reading from the shifter buffers as necessary or you have to use interrupts for that, but at any decent data rates, you're pretty much just sit in the interrupt handler - so you might just as well to the busy polling). FlexIO2 has 2 strides of 4 consecutive pins (0-3, 16-19). FlexIO1 has stride od 4 consecutive pins (4-8) but they are just as good as 4, because FlexIO can only do parallel outputs in power-of-two counts. Important thing to note is that even though only certain shifters support parallel transfer to/from the pins, ALL of them support parallel transfer from/to neighboring shifters. So you can set up shifter 0 to shift the data out 4 bits at a time to the pins, shifter 1 to shift 4 bits at a time to shifter 0, shifter 2 to shifter 1 and so, creating a long chain of shifters, effectively getting an output queue 8 x 32 bits = 32 bytes in length. The number of bits shifted by each shifter is controlled by PWIDTH field in the shifter configuration register SHIFTCFGa, page 2927.

There are two more pieces to the puzzle: when to shift and when to transfer from the shifter to the shifter buffer (or from the shifter buffer to the shifter, when transmitting out). Both of these tasks are controlled by the internal timers. There are 8 timers, and each shifter can pick which timer they use. Each timer has a corresponding counter and an output. The counter is decremented based on some signal - either the FlexIO clock, pin input or by an external trigger. When the counter is zero and it decrements again, the timer output is toggled. And the shifter clock can be either that timer output, or alternatively the pin or external trigger input (TIMDEC setting in TIMCFGa register, page 2934). Each shifter can pick to shift on positive or negative edge of that shifter clock signal (TIMPOL setting in the shifter control register SHIFTCTLa on page 2925).

To make it a bit more convoluted, there are three different modes for decrementing the timer's counter, and they, together with the decrement logic define when the data is moved between the shifter and its corresponding shifter buffer. In the simplest, 16-bit mode, whenever the counter is at zero and decrements, the timer toggles the output and the data is moved from shifter to buffer (or vice versa). In the 8-bit mode, the counter is divided into two 8 bit parts. Whenever the lower half gets to zero and decrements, the timer output toggles, the higher half decrements and the lower half is reloaded. When the higher half gets to zero, the data is transferred between the shifter and the buffer. The third mode is the PWM mode, where again, the counter is divided into two 8-bit parts. The lower eight bits control decrement when the timer output is high, and when the value gets to zero, the timer output toggles to low and the high 8 bits start decrementing. When they reach zero, the output toggles again and the data is copied between the shifter and the buffer. Technically the timer output is available as an output from the FlexIO module (section 50.2.6.1 on page 2895) but it doesn't seem it's available as an input to any of the XBARs, so I'm not sure how useful this is. The timer output can be however routed back to FlexIO, as a trigger for another timer.

And to make it more flexible/confusing, timers also get yet another independent input - a trigger. A trigger can be used as the clocking signal for the timer, but more interestingly to enable or disable given timer. The TIMCTLa register (page 2932) lets us to pick an internal trigger or an external one (TRGSRC field). The internal ones are pin inputs, shifter status flags (indicating if the data has been loaded from/to the shifter from the corresponding buffer) or other timer outputs. External triggers can be hooked up through the XBAR. Selecting the trigger is done with TRGSEL field, and the logic is funky to say the least - see page 2933. For the external trigger you simply select the index (corresponding to the FLEXIOa_TRIGGER_INb outputs from the XBAR - see page 72). Then, the timer configuration register TIMCFGa allows us to set when to enable or disable the given timer with TIMDIS and TIMENA fields (page 2936). This allows to implement really useful functionality, where the entire send/receive process is only performed when some signal is high or low (something like ENABLE/DATA_VALID/whatever).

The above describes only the most basic functionality of the shifters - the transmit and receive mode. They can also run in other modes - for instance you can define a small state machine, flipping between different shifters based on the inputs, but for now, we can ignore that - if you need, all the derails are described on pages 2886 and on. Then there's also some extra functionality for inserting extra start and stop bits, resetting the timers on different occasions and some other more exotic stuff. The pretty unusual bit (in a good sense, this doesn't really happen too often in this documentation...) is the 50.4 section starting on page 2896 - that goes over the configuration for a number of popular protocols, including UART, SPI, I2C and some others. It's a good start when you want to look for some examples.

Fairly realistic use case

Now, we'll go through implementing a high speed, 12-bit wide parallel interface (it can totally run 50MHz, probably faster too, though even at 50MHz it's tricky to process the data online - though perfectly fine if you just want to store it - assuming it fits in your memory budget).

The setup will be as follows:

  • there are 12 parallel data lines that will present new data every clock cycle
  • there's also a clock signal
  • and a DATA_VALID signal, that is high when the data presented on the parallel lines is valid, and should be ignored when it's low. The clock singal still clocks when the DATA_VALID is low.

which is a pretty common was of accessing CMOS sensors (which is why I even started looking into any of this).

First of all we need to get these 12 data lines through. As noted before, to use the parallel input/output functionality, the pins need to be continuous. Let's disregard FlexIO3, because it doesn't support DMA (and has some other shortcomings - for instance we cannot route signals through XBAR to it). FlexIO1 only has few pins exposed as Teensy pins, not enough for us. FlexIO2 has 13 pins available - which looks like it's almost enough for us - we would be only one short, but as it turns out that's actually enough.

The problem is that the pins available are not continuous. There's the 0-3 range and the 16-19 range, and the remaining pins are scattered all over the place. We definitely will not be able to use the parallel input functionality, at least not with all our 12 bits of data (and actually parallel input does only power-of-two counts). There's however a fairly simple way to work around this, it just requires massaging the data a bit.

To output n bits over n serial interfaces, instead of a single n-bit parallel one, we need to transpose the data first. We take the 0th bits from 32 words and pack them in a single unsigned int. Then we do the same with 1st bit from each of these word and so one. In the end, we get twelve 32bit words representing 32 elements from the data stream - and we can just output them at the same time, one bit by one bit - getting parallel data stream, without the need for the parallel output. Same goes for input, just in the opposite order.

The i.MX RT1060 however has only 8 shifters, so we can only get an 8-bit wide bus this way. But this is when the parallel shifters come in handy - we can output some data in parallel, using parallel shifters (instead of transposing the data bit by bit, we need to do it in chunks of the width of our parallel output), and some serially. Teensy has these two 4-bit ranges of continuous FlexIO2 pins - and we can use them both - which will give us 8 bits - and the remaining 4 bits will be processed serially.

The setup will be as follows:

  • Teensy pinsy 6, 9 corresponding the to the pads B0_10, B0_11 will get the first two bits, each of them processing a single bit. They represent FlexIO2 pins 10, 11
  • Teensy pins 10, 12, 11, 13, corresponding the the native pads B0_00 - B0_03, which represent FlexIO2 pins 0-3, will get the next four bits, grabbing them using parallel shift 4 bits wide
  • Teensy pinsy 35, 34 corresponding the to the pads B1_12, B1_13 will get the next two bits, each of them processing a single bit. They represent FlexIO2 pins 28, 29
  • Teensy pins 8, 7, 36, 37 corresponding to pads B1_00 - B1_03 representing FlexIO2 pins 16-19 will get four bits again, in parallel.

It's quite possible that there's a better way of choosing that setup, that will result in a simpler decoding code later, but I didn't bother to investigate that avenue.

To set it up, we first have to remember about actually enabling FlexIO2 clock:

CCM_CCGR3 |= CCM_CCGR3_FLEXIO2(CCM_CCGR_ON);

Another thing is the actual clock speed for FlexIO2, which defaults to 30MHz (see the nice diagram on page 1016 - the default paths are marked with dots on the multiplexers!). This is way too slow for us, we want it at maximum. According to the table on page 1031 that maximum for FlexIO2 is 120MHz, so we find how to set these dividers that are on the way from PLL3 (that's 480MHz) to FlexIO2 on the clocking diagram. We want the overall divider to be 4, so we set CS1CDR[FLEXIO2_CLK_PODF] to 1 - which makes it makes it divide by 2 - so together with CS1CDR[FLEXIO2_CLK_PRED] will divide our PLL3 frequency by four.

CCM_CS1CDR &= ~( CCM_CS1CDR_FLEXIO2_CLK_PODF( 7 ) );
CCM_CS1CDR |= CCM_CS1CDR_FLEXIO2_CLK_PODF( 1 );

Note about that maximum: I tried setting it to be clocked even higher (twice as high and it worked too. I haven't thoroughly tested it, so maybe it becomes unstable at times, or maybe heats up like crazy, or maybe there's some limiter somewhere and it doesn't really go faster - but anyway, I tried and it doesn't blow up the chip instantly, so if you're in need, maybe you can actually run it faster.

Now we need to enable the FlexIO2 itself. For some reason, there's an extra "enable" bit in one of its registers, so let's flip it on (PARAM register, page 2914)

FLEXIO2_CTRL |= 1;

Now that we have the FlexIO2 up and running, we can start setting up the shifters. First, lets set the external pads to actually use the FlexIO2 mode (see KurtE's diagrams linked earlier, or if you want to find it in the docs, the registers are, for instance, on page 508, and table with all the modes for each pad in on page 293):

IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_10 = 4;		// 10
IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_11 = 4;		// 11

IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_00 = 4;		// 0
IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_01 = 4;		// 1
IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_02 = 4;		// 2
IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_03 = 4;		// 3

IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_12 = 4;		// 28
IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_13 = 4;		// 29

IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_00 = 4;		// 16
IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_01 = 4;		// 17
IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_02 = 4;		// 18
IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_03 = 4;		// 19

Setting up a shifter requires filling up two registers: SHIFTCTLa and SHFTCFGa (pages 2925 and on). Lets start with SHIFTCTL for shifter 0. We want it to be controlled with timer 0, shifting on positive edge, we disable the pin output (remember, we're reading the data in!), we want it to control pin 10 (that's the FlexIO2 pin index!), we want it active high and want it in receive mode:

FLEXIO2_SHIFTCTL0	= FLEXIO_SHIFTCTL_TIMSEL( 0 )  |  // timer 0
			  // FLEXIO_SHIFTCTL_TIMPOL	           |  // on positive edge
			  FLEXIO_SHIFTCTL_PINCFG( 0 )          |  // pin output disabled
			  FLEXIO_SHIFTCTL_PINSEL( 10 )         |  // pin 0
			  // FLEXIO_SHIFTCTL_PINPOL            |  // active high
			  FLEXIO_SHIFTCTL_SMOD( 1 );                        // receive mode

Next the SHIFTCFG register: we want to shift in one bit at a time, from the pin, and we don't need any start or stop bits:

FLEXIO2_SHIFTCFG0	= FLEXIO_SHIFTCFG_PWIDTH( 0 )  |  // single bit
			  // FLEXIO_SHIFTCFG_INSRC		       |  // from pin
			  FLEXIO_SHIFTCFG_SSTOP( 0 )           |  // stop bit disabled
			  FLEXIO_SHIFTCFG_SSTART( 0 );            // start bit disabled

The settings are the same for shifters 1, 4 and 5, since they also receive bits one at a time. The only thing changing is the pin number:

FLEXIO2_SHIFTCTL1	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 11 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG1	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

FLEXIO2_SHIFTCTL4	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 28 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG4	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

FLEXIO2_SHIFTCTL5	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 29 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG5	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

For the parallel input, the setup is slightly more tricky. So far we used 4 shifters, and we will use another two now that have the parallel input capability. This leaves two shifters unused. We will actually use them to receive the data from the two parallel input shifter. This way, when they are full, instead of just overflowing and loosing the data (or forcing the CPU/DMA to pick it up), they will shift it into the neighboring shifters, effectively allowing for longer buffering. Shifters are 32 bits wide, so if we're shifting in 4 bits at a time, we will fill it in 8 cycles. Adding these extra shifters as additional buffers gives us 16 cycles between fills, giving some more headroom. There's this slight imbalance here, as the single-bit shifters can buffer through 32 cycles (because they are 32 bits wide), while the parallel ones, even with that extra backup can only do 16 cycles - but oh well, shit happens, we need to live with that.

Shifter 3 is the parallel input one. For the pin we just select the first one, and for the width, we set 3 (anything between 1 and 3 indicates 4 bit shifts). Everything else stays the same:

FLEXIO2_SHIFTCTL3	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG3	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

Shifter 2 is configured to grab the data falling out the shifter 3. SHIFTCTL is set up the same (pin number doesn't matter here), and in SHIFTCFG we set the bit indicating that we're grabbing data from the shifter N+1 (so 3 here) instead of the pin:

FLEXIO2_SHIFTCTL2	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG2	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_INSRC | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

Shifters 6 and 7 are set up in the same way:

FLEXIO2_SHIFTCTL6	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG6	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_INSRC | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

FLEXIO2_SHIFTCTL7	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 16 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
FLEXIO2_SHIFTCFG7	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

That small caveat from above here: the docs, and the header files mention only 4 shifters, so we want to add the following macros:

#define FLEXIO2_SHIFTCTL4		(IMXRT_FLEXIO2.offset090)
#define FLEXIO2_SHIFTCTL5		(IMXRT_FLEXIO2.offset094)
#define FLEXIO2_SHIFTCTL6		(IMXRT_FLEXIO2.offset098)
#define FLEXIO2_SHIFTCTL7		(IMXRT_FLEXIO2.offset09C)
#define FLEXIO2_SHIFTCFG4		(IMXRT_FLEXIO2.offset110)
#define FLEXIO2_SHIFTCFG5		(IMXRT_FLEXIO2.offset114)
#define FLEXIO2_SHIFTCFG6		(IMXRT_FLEXIO2.offset118)
#define FLEXIO2_SHIFTCFG7		(IMXRT_FLEXIO2.offset11C)

Now it's time for the clocking signal. We still have one FlexIO2 pin left so lets use it - it's Teensy pin 32, so pad B0_12, FlexIO2 pin 12. Set it to FlexIO2 mode first:

IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_12 = 4;

We also need to take care of the DATA_VALID line here, that should enable/disable the whole clocking. But we're out of FlexIO2 pins. Fortunately, signals can be routed into FlexIO2 as triggers through the XBAR. This is exactly the same as in the case of DMA. Let's take Teensy pin 4 again (pad EMC 06), like before

Set it to XBAR mode:

IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 3;

Set it to be XBAR input:

IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8);

Set the daisy-chaining to pick EMC06

IOMUXC_XBAR1_IN08_SELECT_INPUT = 0;

And connect it to FlexIO2 as trigger 0:

xbar_connect( XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_FLEXIO2_TRIGGER_IN0 );

Now the timer setup, first the TIMCTLa register (page 2932). We want it to use trigger 0 (TRGSEL = 0), that's active high (TRGPOL not set), and it's an external trigger (TRGSRC not set), we disable the output (because the clocking comes from outside, we're not generating it - PINCFG = 0), and we set the clock pin to be pin 12 (FlexIO2 pin again!). We want the timer pin to be active high (PINPOL not set) and we want to use the 16 bit counter mode.

FLEXIO2_TIMCTL0		= FLEXIO_TIMCTL_TRGSEL( 0 )		|			// src trigger 0
			  // FLEXIO_TIMCTL_TRGPOL		|			// trigger active high
			  // FLEXIO_TIMCTL_TRGSRC		|			// exeternal trigger
			  FLEXIO_TIMCTL_PINCFG( 0 )		|			// timer pin output disabled
			  FLEXIO_TIMCTL_PINSEL( 12 )		|			// timer pin 12
			  // FLEXIO_TIMCTL_PINPOL		|			// timer pin active high
			  FLEXIO_TIMCTL_TIMOD( 3 );					// timer mode 16bit

Next the TIMCFG register (page 2934). We don't really care about the timer output signal, so we set it to be low, and not affected by reset (whatever - TIMOUT = 0). Now the really important bit: timer decrement mode - we set it to 2 - to decrement on pin input, on every edge and set the shift clock to be pin input (in the shifter config we set it to shift on positive edge of that signal). The clock signal supplied to pin 12 that we just set will cause our 16 bit counter to decrement on every edge, both rising and falling. We want the timer to get disabled on a falling edge of the trigger (TIMDIS = 6) and enabled again on the rising trigger (TIMENA = 6) - this way, the timer will be working only when the trigger signal (so our DATA_VALID) is high. We could reset the timer to the default value when the trigger goes high (TIMRST = 6), but we can just as well not do it (TIMRST = 0). We don't need any extra stop or start bits handles (TSTOP = 0, TSTART not set):

FLEXIO2_TIMCFG0		= FLEXIO_TIMCFG_TIMOUT( 0 )		|			// timer output = low, not affcted by reset
			  FLEXIO_TIMCFG_TIMDEC( 2 )		|			// decrement on pin input (both edges), shift clock = pin input
			  FLEXIO_TIMCFG_TIMRST( 0 )		|			// dont reset timer 
			  FLEXIO_TIMCFG_TIMDIS( 6 )		|			// disable timer on trigger falling
			  FLEXIO_TIMCFG_TIMENA( 6 )		|			// enable timer on trigger rising
			  FLEXIO_TIMCFG_TSTOP( 0 )		;			// stop bit disabled
			  //FLEXIO_TIMCFG_TSTART					// start bit disabled

We want to shift 16 times (to fully fill shifters doing parallel input) and then transfer data from shifters to the shifter buffer - so for the reset value for the timer we need to set 31 (remember, it's decremented on both edges, and it does the transfer when it's zero and tries to decrement):

FLEXIO2_TIMCMP0		= 31;	// move from shift to shiftbuf every 32 timer ticks (so 16 shift 
clock cycles)

We could also use 8 bit counter mode, setting the reset value low byte to 1 (so two edges for every shift) and the high byte to 15 (so 16 shifts until transfer to shifter buffers) - or at least I think so, I haven't tried that.

And that's pretty much it for the FlexIO setup. We however need to grab the received data from the shifter buffers and copy it to some buffer. And since we're already DMA champions, there's no better way than to use DMA. Because the setup is slightly more tricky this time, we'll do most it by hand, directly in the registers (actually... it might be possible to do all that just with the available DMA interface, I was however experimenting with all this a bit, had to touch the internals by hand and it kind of stayed that way...)

In total, we want to transfer 32 bytes (8 shifter buffers, each 32 bits, so 4 bytes wide)

dmaChannel.TCD->NBYTES		= 8 * 4;

The start address for the transfer is of course the address of the 0th shifter buffer register, and each transfer moves forward by 4 bytes.

dmaChannel.TCD->SADDR		= &FLEXIO2_SHIFTBUF0;
dmaChannel.TCD->SOFF		= 4;

We want the transfer to be 32 bit, and we want our source address to be computed modulo 32 - so after reading 32 bytes, the address will effectively reset to the beginning. We could also set the SLAST to be -32, to move the start address back by 32 bytes after finishing:

dmaChannel.TCD->ATTR_SRC	= ( 5 << 3 ) | 2;								// 32 bit reads + 2^5 
modulo
dmaChannel.TCD->SLAST		= 0;

For the destination, we just set our target buffer, also doing 4 byte writes, incrementing the address by 4 after every time and rewidning to the beginning after the entire transfer is done:

dmaChannel.TCD->DADDR		= dmaBuffer;
dmaChannel.TCD->DOFF		= 4;
dmaChannel.TCD->ATTR_DST	= 2;											// 32 bit writes
dmaChannel.TCD->DLASTSGA	= -DMABUFFER_SIZE * 4;

We set the total number of major loop to be DMABUFFER_SIZE / 8 (DMABUFFER_SIZE is in DWORDs, and we copy 8 of them in every loop iteration)

dmaChannel.TCD->BITER		= DMABUFFER_SIZE / 8;
dmaChannel.TCD->CITER		= DMABUFFER_SIZE / 8;

We also don't want to disable that DMA after it finishes, we just want to keep it going, but we want to get an interrupt when the transfer is half done and fully done - effectively dividing the target buffer in half and doing a double buffering scheme - processing one half while the other is being filled:

dmaChannel.TCD->CSR	&= ~(DMA_TCD_CSR_DREQ);				// do not disable the channel after it 
completes - so it just keeps going 
dmaChannel.TCD->CSR	|= DMA_TCD_CSR_INTMAJOR | DMA_TCD_CSR_INTHALF;	// interrupt at completion and at half completion

We now attach the interrupt handler and make the DMA to be triggered by the request 0 from FlexIO2

dmaChannel.attachInterrupt( inputDMAInterrupt );
dmaChannel.triggerAtHardwareEvent( DMAMUX_SOURCE_FLEXIO2_REQUEST0 );

The final bit is actually enabling that DMA request in the FlexIO - we want it on the shifter status flag, which is set when the data is transferred from shifter to shifter buffer (SHIFTSDEN register, page 2923). Since all the shifters work in sync, we can just set one bit:

FLEXIO2_SHIFTSDEN |= 1 << 0;

We can now enable the DMA and the system is ready to receive the data:

dmaChannel.enable();

Of course we need to remember that the data received has all the bits intertwined, so we need to shuffle them around to get the actual values!

The entire code is below. It includes the shuffling of the bits and basic consistency checks for the received values - I've been testing it with a binary counter. It works totally fine clocked up to 25MHz (after I hooked up enough ground signals... see here: https://forum.pjrc.com/threads/66170...141#post269141). At 50MHz it works fine too, but I'm starting to get these glitches caused by insufficient grounding, and it's actually too fast to decode the incoming data on the fly. If you're just aiming to capture the data and process it then, it should be fine though (maybe you can get even faster than that).

Hope it's useful to anyone, and post a correction if you spot any errors!

void xbar_connect(unsigned int input, unsigned int output)
{
	if (input >= 88) return;
	if (output >= 132) return;

	volatile uint16_t *xbar = &XBARA1_SEL0 + (output / 2);
	uint16_t val = *xbar;
	if (!(output & 1)) {
		val = (val & 0xFF00) | input;
	} else {
		val = (val & 0x00FF) | (input << 8);
	}
	*xbar = val;
}


void inputDMAInterrupt()
{
	dataCorrect = true;

	prevTime = currTime;
	currTime = micros();  

	dmaStartTime = micros();

	uint32_t* dmaData = dmaBuffer + ( DMABUFFER_SIZE / 2 ) * ( dmaBufferHalfCount & 1 );
	uint32_t* processedData = processedBuffer + ( DMABUFFER_SIZE / 2 ) * ( dmaBufferHalfCount & 1 ) * 2;	

	uint32_t inData[] = { dmaData[2],		// pins 0-3
						  dmaData[3],		
						  dmaData[6],		// pins 16-19
						  dmaData[7],
						  dmaData[0],		// pin 10
						  dmaData[1],		// pin 11
						  dmaData[4],		// pin 28
						  dmaData[5] };		// pin 29


	for( int batch=0; batch< ( DMABUFFER_SIZE / 2 ) / 8; ++batch )
	{
		for( int i=0; i<16; ++i )
		{
			uint32_t pins_00_03 = ( ( ( i < 8 ) ? dmaData[2] : dmaData[3] ) >> ( ( i & 0x07 ) * 4 ) ) & 0x0F;
			uint32_t pins_16_19 = ( ( ( i < 8 ) ? dmaData[6] : dmaData[7] ) >> ( ( i & 0x07 ) * 4 ) ) & 0x0F;
			uint32_t pin_10 = ( dmaData[0] >> ( 16 + i ) ) &  1;
			uint32_t pin_11 = ( dmaData[1] >> ( 16 + i ) ) &  1;
			uint32_t pin_28 = ( dmaData[4] >> ( 16 + i ) ) &  1;
			uint32_t pin_29 = ( dmaData[5] >> ( 16 + i ) ) &  1;
	
			uint32_t outData = ( pins_00_03		 ) |
							   ( pins_16_19 << 4 ) |
							   ( pin_10 << 8 )	   |
							   ( pin_11 << 9 )	   |
							   ( pin_28 << 10 )	   |
							   ( pin_29 << 11 );
	
			processedData[i] = outData;

			if ( ( ( prevVal + 1 ) & 4095 ) != outData )
			{
				dataCorrect = false;
			}
			prevVal = outData;
		}
	
		dmaData += 8;
		processedData += 16;
	}

	dmaEndTime = micros();

	++dmaBufferHalfCount;

	dmaChannel.clearInterrupt();	// tell system we processed it.
	asm("DSB");						// this is a memory barrier
}



void setupFlexIOInput()
{
	// FlexIO2 works at 30Mhz by default, we need it faster!

	// set the FlexIO2 clock divider to 2 instead of 8
	CCM_CS1CDR &= ~( CCM_CS1CDR_FLEXIO2_CLK_PODF( 7 ) );
	CCM_CS1CDR |= CCM_CS1CDR_FLEXIO2_CLK_PODF( 1 );

	//CCM_CS1CDR |= CCM_CS1CDR_FLEXIO2_CLK_PODF( 0 );	// even faster seems to work ;-)

	// enable clock for FlexIO2
	CCM_CCGR3 |= CCM_CCGR3_FLEXIO2(CCM_CCGR_ON); 

	// enable clock for clock XBAR
	CCM_CCGR2 |= CCM_CCGR2_XBAR1(CCM_CCGR_ON);

	// enable FlexIO2
	FLEXIO2_CTRL |= 1;

	// fast mode -if it's on, the 0/1 DMA requests do not work
	// FLEXIO2_CTRL |= 1 << 2;	

	///////////////////////////////////////////////
	// set the pads corresponding to the FlexIO2 pins 0-3, 10, 11, 28, 29, 16-19 to proper mode

	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_00 = 4;		// 0
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_01 = 4;		// 1
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_02 = 4;		// 2
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_03 = 4;		// 3
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_10 = 4;		// 10
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_11 = 4;		// 11

	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_00 = 4;		// 16
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_01 = 4;		// 17
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_02 = 4;		// 18
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_03 = 4;		// 19
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_12 = 4;		// 28
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B1_13 = 4;		// 29

	// set the mode for the clock pin
	IOMUXC_SW_MUX_CTL_PAD_GPIO_B0_12 = 4;		// 12

	//////////////////////////////////////////////
	// setup shifters and timer

	//		0 - single bits from pin 10
	FLEXIO2_SHIFTCTL0	= FLEXIO_SHIFTCTL_TIMSEL( 0 )		|			// timer 0
						  //FLEXIO_SHIFTCTL_TIMPOL			|			// on positive edge
						  FLEXIO_SHIFTCTL_PINCFG( 0 )		|			// pin output disabled
						  FLEXIO_SHIFTCTL_PINSEL( 10 )		|			// pin 0
						  //FLEXIO_SHIFTCTL_PINPOL			|			// active high
						  FLEXIO_SHIFTCTL_SMOD( 1 );					// receive mode
	
	FLEXIO2_SHIFTCFG0	= FLEXIO_SHIFTCFG_PWIDTH( 0 )		|			// single bit
						  //FLEXIO_SHIFTCFG_INSRC			|			// from pin
						  FLEXIO_SHIFTCFG_SSTOP( 0 )		|			// stop bit disabled
						  FLEXIO_SHIFTCFG_SSTART( 0 );					// start bit disabled
					
	////		1 - single bits from pin 11
	FLEXIO2_SHIFTCTL1	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 11 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG1	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		2 - 4 bits from shifter 3
	FLEXIO2_SHIFTCTL2	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG2	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_INSRC | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		3 - 4 bits from pins 0-3
	FLEXIO2_SHIFTCTL3	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG3	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		4 - single bit from pin 28
	FLEXIO2_SHIFTCTL4	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 28 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG4	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		5 - single bit from pin 29
	FLEXIO2_SHIFTCTL5	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 29 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG5	= FLEXIO_SHIFTCFG_PWIDTH( 0 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		6 - 4 bits from shifter 7
	FLEXIO2_SHIFTCTL6	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 0 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG6	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_INSRC | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );
	
	//		7 - 4 bits from pins 16-19
	FLEXIO2_SHIFTCTL7	= FLEXIO_SHIFTCTL_TIMSEL( 0 ) | FLEXIO_SHIFTCTL_PINCFG( 0 ) | FLEXIO_SHIFTCTL_PINSEL( 16 ) | FLEXIO_SHIFTCTL_SMOD( 1 );
	FLEXIO2_SHIFTCFG7	= FLEXIO_SHIFTCFG_PWIDTH( 3 ) | FLEXIO_SHIFTCFG_SSTOP( 0 ) | FLEXIO_SHIFTCFG_SSTART( 0 );

	// timer 0 - clocked from pin 12, enabled by an external trigger rise, disabled by the external trigger fall
	FLEXIO2_TIMCTL0		= FLEXIO_TIMCTL_TRGSEL( 0 )		|			// src trigger 0
						  // FLEXIO_TIMCTL_TRGPOL		|			// trigger active high
						  // FLEXIO_TIMCTL_TRGSRC		|			// exeternal trigger
						  FLEXIO_TIMCTL_PINCFG( 0 )		|			// timer pin output disabled
						  FLEXIO_TIMCTL_PINSEL( 12 )	|			// timer pin 12
						  // FLEXIO_TIMCTL_PINPOL		|			// timer pin active high
						  FLEXIO_TIMCTL_TIMOD( 3 );					// timer mode 16bit

	FLEXIO2_TIMCFG0		= FLEXIO_TIMCFG_TIMOUT( 0 )		|			// timer output = low, not affcted by reset
						  FLEXIO_TIMCFG_TIMDEC( 2 )		|			// decrement on pin input (both edges), shift clock = pin input
						  FLEXIO_TIMCFG_TIMRST( 6 )		|			// timer reset on trigger rising (this resets the timer when line valid becomes asserted)
						  FLEXIO_TIMCFG_TIMDIS( 6 )		|			// disable timer on trigger falling
						  FLEXIO_TIMCFG_TIMENA( 6 )		|			// enable timer on trigger rising
						  FLEXIO_TIMCFG_TSTOP( 0 )		;			// stop bit disabled
						  //FLEXIO_TIMCFG_TSTART					// start bit disabled	

	FLEXIO2_TIMCMP0		= 31;										// move from shift to shiftbuf every 32 timer ticks (so 16 shift clock cycles)
			

	/////////////////////////////////////////////////////////
	// setup external trigger (line valid signal)

	// set the IOMUX mode to 3, to route it to XBAR
	IOMUXC_SW_MUX_CTL_PAD_GPIO_EMC_06 = 3;	
	
	// set XBAR1_IO008 to INPUT
	IOMUXC_GPR_GPR6 &= ~(IOMUXC_GPR_GPR6_IOMUXC_XBAR_DIR_SEL_8);
	
	// daisy chaining - select between EMC06 and SD_B0_04
	IOMUXC_XBAR1_IN08_SELECT_INPUT = 0;
	
	// connect the IOMUX_XBAR_INOUT08 to FlexIO2 trigger 0
	xbar_connect( XBARA1_IN_IOMUX_XBAR_INOUT08, XBARA1_OUT_FLEXIO2_TRIGGER_IN0 );
		
	////////////////////////////////////////////////////////
	// setup dma to pick up the data

	// configure DMA channels
	dmaChannel.begin();

	dmaChannel.TCD->SADDR		= &FLEXIO2_SHIFTBUF0;
	dmaChannel.TCD->SOFF		= 4;
	dmaChannel.TCD->ATTR_SRC	= ( 5 << 3 ) | 2;								// 32 bit reads + 2^5 modulo
	dmaChannel.TCD->SLAST		= 0;

	dmaChannel.TCD->DADDR		= dmaBuffer;
	dmaChannel.TCD->DOFF		= 4;
	dmaChannel.TCD->ATTR_DST	= 2;											// 32 bit writes
	dmaChannel.TCD->DLASTSGA	= -DMABUFFER_SIZE * 4;			

	dmaChannel.TCD->NBYTES		= 8 * 4;										// write 32 bytes - all the shiftbuf registers
	dmaChannel.TCD->BITER		= DMABUFFER_SIZE / 8;
	dmaChannel.TCD->CITER		= DMABUFFER_SIZE / 8;
	
	dmaChannel.TCD->CSR		   &= ~(DMA_TCD_CSR_DREQ);							// do not disable the channel after it completes - so it just keeps going 
	dmaChannel.TCD->CSR		   |= DMA_TCD_CSR_INTMAJOR | DMA_TCD_CSR_INTHALF;	// interrupt at completion and at half completion

	dmaChannel.attachInterrupt( inputDMAInterrupt );
	dmaChannel.triggerAtHardwareEvent( DMAMUX_SOURCE_FLEXIO2_REQUEST0 );
		 
	// enable DMA on shifter status
	FLEXIO2_SHIFTSDEN |= 1 << 0;	
	
	// enable DMA
	dmaChannel.enable();
}


void setup()
{
	Serial.begin(115200);	

	setupFlexIOInput();	
}


void loop()
{	
	delay( 100 );

	if ( lastSeenHalf != dmaBufferHalfCount )
	{ 
		uint32_t* dmaData = processedBuffer + 2 * ( DMABUFFER_SIZE / 2 ) * ( dmaBufferHalfCount & 1 );

		Serial.printf( "%s %8u, %8u, 0x%08X 0x%08X 0x%08X 0x%08X 0x%08X 0x%08X 0x%08X 0x%08X\n", dataCorrect ? "" : "DATA INCORRECT!", currTime - prevTime, dmaEndTime - dmaStartTime, dmaData[0], dmaData[1], dmaData[2], dmaData[3], dmaData[4], dmaData[5], dmaData[6], dmaData[7]  );

		lastSeenHalf = dmaBufferHalfCount;
	}
	else
	{
		//Serial.printf("Nothing\n" );
	}
}