canbus - victronenergy/venus GitHub Wiki
Contents
Introduction
The main canbus driver in the Linux world is Socketcan. That is also what is used in Venus OS. Userland applications such as our VE.Can driver (vecan-dbus
) as well as firmware update commandline executables (vup
) all connect to the socketcan driver. For details on socketcan, read here.
For our GX devices, we use different types of canbus hardware:
- controllers native in the System on Chip (SoC)
- external can controllers, connected to the SoC over serial port, on a real uart or over USB.
On the userland side, these are the main services & commandline executables used:
- vecan_dbus: interfaces between the canbus and D-Bus using the NMEA2000/VE.Can protocol.
- canbms-bms: interfaces between the canbus and D-Bus using the canbus-bms protoco, as used by BYD, Pylon & Freedomwon and more.
- vup: commandline firmware updating; its used by the VRM Remote Firmware feature.
All of it is managed by the gui: depending on what bitrate & profile chosen by the user, the gui executable will start the right driver against the right can-interface(s); as well as set the proper bitrate on that interface.
Setting the bitrate is done by calling the can-set-rate script.
For slcan, there is some further magic in udev. And then there are the /etc/venus/canbus_ports* configuration files.
Requirements
- A real loopback function
- Auto recovery from bus-off (normally done by Linux; by the
restart-ms 100
parameter in can-set-rate, but for that the driver needs to support it) - TX: no frames dropping ever
- RX: no frames dropping at 250kBits, dlc=8, preferably 500kbps as well to make sure.
- RX: no frames reordering under same conditions as (4).
- Proper error reporting, like bus-off events, as well as traffic counting
Testing
0. The venus-socketcan-test script
Jeroen has written a tests script that covers all situations:
https://github.com/victronenergy/venus-socketcan-test
Rest of chapters below cover the same but then when testing manually.
1. Going to, and auto-recovery from- error passive
By unplugging it, ie. no other device to talk to (or ack), the controller goes into error-passive.
Test that: 1a) it really went into error passive 1b) that it auto recovers from that.
2. Auto-recovery from bus-off mode
After x amount of issues, a can controller goes bus off: it no longer transmits on the bus. It will need a command from the CPU to get going again. See google for more info.
How to get a bus-off is not 100% clear. We can try putting another device on the same bus on a wrong bit rate.
Test that:
2a) it really went in to bus off
2b) that it auto-recovers. In Venus OS; we use the restart-ms 100
setting for that, see socketcan documentation.
3. Check that the loopback works OK
A working loopback is a requirement of our userland applications. An application that sends data needs to receive its own messages back, as well as messages sent out by others (on other sockets) on the same can interface. And in the same order; otherwise data obtained via e.g. candump will in different order as the order in which userland applications receive the frames, when having multiple applications send at the same time [1, chapter 6.2].
The loopback is used for at least two cases:
- Determine if a message is actually send to the bus
- Handle broadcast request, when asking all nodes to respond, the requesting node itself should answer as well. (and for completeness all other processes)
Note: this can be done on a per message bases, see VE_CAN_DO_LOOP, but linux doesn't support that.
VE_CAN_LOCALHOST is used to filter out devices witch are forwarded to the can-bus. E.g. if there is a battery monitor connected by usb and its data is also send to the CAN-bus, vup should not report it again as another device, otherwise it will be found twice.
Tests:
- 3a - driver supports loopback: fallback in PF_CAN is not good enough
- 3b - when you send a message; its also received back on the same socket.
- 3c - sending a message on another socket is looped back. (run vup & candump on the same socket).
3a loopback sequence order
6.2 local loopback of sent frames
As described in chapter 3.2 the CAN network device driver should
support a local loopback functionality similar to the local echo
e.g. of tty devices. In this case the driver flag IFF_ECHO has to be
set to prevent the PF_CAN core from locally echoing sent frames
(aka loopback) as fallback solution:
dev->flags = (IFF_NOARP | IFF_ECHO);
From https://www.kernel.org/doc/Documentation/networking/can.txt
- sunxi does: https://github.com/torvalds/linux/blob/master/drivers/net/can/sun4i_can.c#L812
- ti_hecc does: https://github.com/torvalds/linux/blob/master/drivers/net/can/ti_hecc.c#L952
- slcan does not support proper loopback: https://github.com/torvalds/linux/blob/master/drivers/net/can/slcan.c#L449. Because of that ... tbd
3b & c
On a CCGX, a vup output looks like this:
root@ccgx:~# vup -d
000.140 > 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
984.160 { 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
000.145 < 7.0.0.EF.FF.E0| 0x66 0x99 0x01 0x20 0xFF 0xFF 0x00 0x00
000.398 > 6.0.0.EA.FF.F0| 0x00 0xEE 0x00
984.418 { 6.0.0.EA.FF.F0| 0x00 0xEE 0x00
000.403 < 6.0.0.EE.FF.E0| 0x64 0x00 0xC0 0x2C 0x00 0x82 0xF0 0xC0
000.404 [ 6.0.0.EE.FF.E1| 0xF4 0x01 0xC0 0x2C 0x00 0x82 0xF0 0xC0
000.405 > 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
984.425 { 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
000.671 [ 0.0.0.00.03.05| 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Querying devices
001.004 > 6.0.0.EA.FF.F0| 0x00 0xEE 0x00
985.022 { 6.0.0.EA.FF.F0| 0x00 0xEE 0x00
001.006 > 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
001.008 [ 6.0.0.EE.FF.E1| 0xF4 0x01 0xC0 0x2C 0x00 0x82 0xF0 0xC0
985.025 { 6.0.0.EE.FF.F0| 0x4B 0x01 0xC0 0x2C 0x00 0x82 0x32 0xC0
001.012 < 6.0.0.EE.FF.E0| 0x64 0x00 0xC0 0x2C 0x00 0x82 0xF0 0xC0
001.145 < 7.0.0.EF.FF.E0| 0x66 0x99 0x01 0x20 0xFF 0xFF 0x00 0x00
**** Explanation of used characters ****
> -> Frame was sent by us.
< -> Frame was received from other device.
{ = VE_CAN_IS_LOOP -> Frame is a looped back frame sent out on same socket.
So, in above example, sent by ourselves == vup.
[ = VE_CAN_LOCALHOST -> Frame was sent by someone on the same device but on
another socket.
Source: https://github.com/victronenergy/velib/blob/master/src/canhw/canhw_driver.c#L886
Note that the source also lists [ as a VE_CAN_RTR, but thats not used in our application, and
messages with the RTR bit set are dropped, see [1] & [2].
[1] https://git.victronenergy.com/can/can-bus-bms/blob/master/src/task.c#L459
[2] (add same link for velib application)
4. No TX messages lost
Never ever under no circumstance should it be possible for a driver to accept a TX message from a userland application and then later in the process drop it, before putting it out on the canbus.
Userland applications rely on this behaviour: they'll keep pushing frames into the socket, expecting all of them to be sent, until told to stop & wait first;
Test by:
- tbd
5. No RX messages lost
- make sure to stop the victron can processes (svc -t, both vecan-dbus and dbus-can-bms)
- connect two devices together in the test; not more
Background info:
Messages can be dropped in multiple places. First of all, make sure they are properly sent on the bus, by hooking some known 100% working device up and checking its reads.
Then, in the slcan example, the whole pipe looks like this:
msg -> CAN-Controller -> uC -> uC-UART ->Linux-UART ->slcan -> socketcan -> userland.
Now, messages can be dropped in lots of places. And for different reasons. The obvious buffer full. But also a parity error on the uart for example.
As far as I understood:
- anything dropped in the uC is not normally visible. There is no feature for it in the slcan protocol; nor in the slcan.c driver. Perhaps it could be done by using slcand to query the CAN Adapter, though I don't see how you can do it without causing an interruption or even having to stop slcan temporarily.
- anything dropped in the Linux-UART, tty system will be visible as RX error on the ifconfig output.
- anything dropped by socketcan (only reason for that would be that the connected userland application didn't read from the queue in time) is readable, for example by using
candump -d
. See also [3]. In the case of slcan, everything dropped on slcan level is visible as a rx-error on ifconfig output.
Socketcan will also drop messages: it does so in case the userland application isn't keeping up with the in flow. How to debug, and more info here:
- https://marc.info/?l=linux-can&m=152447184508310&w=2 (missing frames (but no errors))
- http://socket-can.996257.n3.nabble.com/Frame-Loss-during-RX-td2828.html
- http://socket-can.996257.n3.nabble.com/ANNOUNCE-new-socket-level-option-to-report-number-of-rx-queue-overflows-td1929.html
(the socket-can mailinglist was moved in 2011 to linux-can)
Using cansequence
For testing, we use cansequence, see pengutronix/can-utils.
An example:
You can use cansequence to generate or receive frames via the CAN interfaces
sender (100k frames with ID:4):
./cansequence --loop=100000 -p -i4 can1
receiver (receive, look for ID:4 and be very verbose: this will generate dump each frame to stdout and that to a file in /tmp/):
./cansequence -r -i4 can0 -vv > /tmp/rx_dump_can0
after the test is done, we stop the cansequence program on each receiving device an calculate the number of received frames with:
grep "received frame." /tmp/rx_dump_can0 |wc -l
To poll or not to poll
Running cansequence only works when using the -p option, which makes it use poll.
Set the tx_queue_length to 1000 or so, to make sure it works when testing.
The reason: Tthe CAN-bus has a low tx_queue_len (which is good, otherwise there are piority inversion issues, respond delays etc)
poll on the other hand expects a rather large buffer.
see https://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf paragraph 3.4, and here is more about it, http://socket-can.996257.n3.nabble.com/Solving-ENOBUFS-returned-by-write-td2886.html
(but the patch is never accepted)
6. All messages must arrive in the same sequence as sent: no re-ordering
tbd
Debugging
Here is a few examples for how to see details on available interfaces (look for can0 and can1):
$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1450 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 02:81:48:ff:08:12 brd ff:ff:ff:ff:ff:ff
3: wifi0: <NO-CARRIER,BROADCAST,MULTICAST,DYNAMIC,UP> mtu 1450 qdisc mq state DOWN mode DORMANT group default qlen 1000
link/ether 0c:9a:42:3b:d2:45 brd ff:ff:ff:ff:ff:ff
4: ap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0e:9a:42:3b:d2:45 brd ff:ff:ff:ff:ff:ff
5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can
6: ll-eth0@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether ae:6e:c2:3e:a3:0b brd ff:ff:ff:ff:ff:ff
7: can1: <NOARP,UP,LOWER_UP> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can
This command shows the details of a canbus. Note that not all info might be meaningfull: not all drivers perse support or fields:
$ ip -details -statistics link show can0
5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state ERROR-ACTIVE restart-ms 100
bitrate 250000 sample-point 0.875
tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
kvaser_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
clock 8000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
1 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
1021893 127741 0 7 0 0
TX: bytes packets errors dropped carrier collsns
117056 16067 0 0 0 0
root@nanopi:/sys/class/net/can1# grep '.*' ./statistics/*
./statistics/collisions:0
./statistics/multicast:0
./statistics/rx_bytes:861984
./statistics/rx_compressed:0
./statistics/rx_crc_errors:0
./statistics/rx_dropped:0
./statistics/rx_errors:0
./statistics/rx_fifo_errors:0
./statistics/rx_frame_errors:0
./statistics/rx_length_errors:0
./statistics/rx_missed_errors:0
./statistics/rx_nohandler:0
./statistics/rx_over_errors:0
./statistics/rx_packets:109058
./statistics/tx_aborted_errors:0
./statistics/tx_bytes:351
./statistics/tx_carrier_errors:0
./statistics/tx_compressed:0
./statistics/tx_dropped:0
./statistics/tx_errors:0
./statistics/tx_fifo_errors:0
./statistics/tx_heartbeat_errors:0
./statistics/tx_packets:62
./statistics/tx_window_errors:0