Example Scenarios 4 - nps-ros2/ns3_testbed GitHub Wiki

Inconsistency in Latency

Here we document inconsistency in latency given the same robot configuration. As part of this study, we simplified, standardized, and upgraded our testbed setup as follows:

We removed the Linux Pipe. Although tests show that using this does not impact measurements, removal simplifies the testbed.
We now log using the ROS2 logger. The ROS2 logger is the standard ROS2 provision for logging.
We removed the testbed monitor GUI. Although only coupled by the Linux pipe, removal simplifies the testbed. Our post-processing is now compatible with ROS2 log syntax. The GUI can still be used if the Linux pipe is enabled via the -p option.
We now use standard ns-3. Although latency detection does not impact performance, it is cleaner to say that we use the standard ns-3 configuration. Latency detection is useful for identifying when ns-3 cannot keep up.
We use the latest ns-3. We upgraded from ns-3.29 to ns-3.30.1.

Note that we must use network namespaces. Here is an article about network namespace scalability: https://netdevconf.info/1.2/papers/pj-netdev-1.2.pdf.

Start the ns3-simulator with five nodes in range, one center node and four nodes one meter out:

cd ~/gits/ns3_testbed/ns3_programs/build
./ns3_mobility -s ns3_defaults.csv -c 5

Start the robots in Root with one talker and four listeners with best_effort QoS:

sudo /bin/bash
cd ~/gits/ns3_testbed/cpp_testbed_runner/build/cpp_testbed_runner
./testbed_runner -c 5 -i example_keep_last_5.csv -n -v

Latency appears to vary. Here is the robot configuration on 11/4:

1.5ms latency

Here is the robot configuration on 11/5:

12 ms latency

Here is the robot configuration on 11/5 with ns-3 upgraded to v3.30.1:

new ns3 latency

Here is the same plot as above but with ns-3 running CSMA Wired for the last few minutes:

new ns3 latency plus

Here is the latency point plot after running for about 20 hours. Points for CSMA Wired are included, then Wifi was resumed. Note that over time, some listeners stopped working. R1 to R3 and R1 to R5 stopped and resumed twice.

new ns3 latency long

Here is the latency point plot measured the next day:

new ns3 latency point plot later

and its associated averaged graph:

new ns3 latency plot later

Examining 802.11 ACK

Above we examined inconsistency in latency. Here we show that our latency is impacted by 802.11 management and control. Specifically, when an 802.11 package is transmitted, the sender must receive an 802.11 ACK or the package will be considered lost.

By configuring ns-3 to capture network traffic, we additionally capture 802.11 MAC packets, in particular the 802.11 ack response. Wireshark does not capture 802.11 MAC packets because they are consumed at the MAC layer. Some links about IEEE 802.11 are https://wiki.wireshark.org/Wi-Fi and https://wiki.wireshark.org/CaptureSetup/WLAN.

Using ns-3 pcap analysis, we see that when the sender in an ad hoc Wifi network does not receive an 802.11 ack in response, it retransmits. We see that his happens 8 times, then gives up. When the sender receives an ack, the transmission is considered received. This explains why there is typically either a 2 ms delay (802.11 ack came back) or a 16 ms delay (8 transmits were sent and no ack came back). This behavior can be manipulated by adjusting the number of Tap devices. When there are only five Tap devices, we see a 2 ms delay on the first transmit, then a 16 ms delay on the second, third, and fourth transmits. When there are 13 Tap devices, we see a 2 ms delay on all four transmits, which is what we expected in the first place.

The problem is that 802.11 ack packets are not being sent back. Whether 802.11 ack packets are sent seems to depend on the number of network Tap devices configured for the system. Here is an example of latency where the capture starts when there are five Tap devices and then more network Tap devices are added (increased from 5 to 13) during the capture:

latency

Sim time vs Wall time

After restoring our network back to 5 network devices, where we do not see 802.11 ack responses and we see everything being transmitted 8 times, we compare sim time performance vs. wall time performance. Here is the plot:

simtime latency

The left side is timing with running ns-3 in simulation time. The right side is timing with ns-3 synchronized to wall time. Every 0.5 seconds wall-time, the ROS2 thread issues a DDS transmit from R1 to R2, R3, R4, and R5. In simulation time, ns-3 simulates the 48 ms of wall-time activity in about 17 ms. ns-3 pcap inspection verifies that in both sim time and wall time, packets are being resent 8 times and no 802.11 ack responses are being sent. Given this plot, it is evident that running ns-3 in sim time does not make the 802.11 ack responses show up. 802.11 ack responses show up by adding more network devices.

Remove the ACK

Above, we identified:

Receiving Wifi nodes may or may not return an 802.11 ack response.
The message is transmitted eight times when the 802.11 ack is not received.
Adding network devices to the OS helps nodes respond with 802.11 ack, reason not identified.

Now we examine performance when the retry count is 0. Specifically, we set the maximum number of retransmission attempts for RTS and DATA packets to 0. Using ns-3 file ~/repos/ns-allinone-3.30.1/ns-3.30.1/src/wifi/examples/wifi-manager-example.cc as an example, we add this configuration to our ns-3 program:

ns3::Config::SetDefault ("ns3::WifiRemoteStationManager::MaxSlrc", ns3::UintegerValue (0));
ns3::Config::SetDefault ("ns3::WifiRemoteStationManager::MaxSsrc", ns3::UintegerValue (0));

The left side of this plot shows the result of removing retries:

no ack latency

A pcap analysis of ns-3 traffic shows that no 802.11 ack packets are sent.

ACK vs no ACK

The left side of the plot above shows latency when ack capability is disabled while the right side of the plot above shows latency when ack is working correctly. The plot shows that there is more latency when ack is used correctly than when ack is not used at all. We can expect latency shown on the right in an ad hoc Wifi network.

Conclusions

There may be a timing bug in how ns-3 manages 802.11 ack responses, which we might want to report.
We have a workaround: by adding unused network Tap devices, ns-3 manages 802.11 ack responses correctly.

Comments

The Ad hoc Wifi and DDS protocols both try to manage dynamic network configuration, resulting in network traffic that burdens any existing load. I might recommend that they are not used together when scalability of swarm size might be an issue.

Additional Comments

Here are example latency values, in seconds, captured at R1 using ns-3. These values correspond to timing at the right of the graph:

.003311 (R5)
.003527
.003806
.004913
.006821
.009530
.009732
.011042

.011249 (R4)
.011337 ack

.011380 (R3)
.011468 ack

.011610 (R2)
.011698 ack

Latency values for R2 are the same as those for R1 except their values are 44 microseconds longer and except for the final ack timing, which does not make sense. For R1, the final ack, .011670, happens before its transmit time, .011698, which is impossible. For R3, R4, and R5, the final ack time is the same as the ack transmit time at R1, .011698.

Conditions causing ACK failure

Above we found that ns-3 is failing to send ACK signals. Here we examine conditions that cause ACK failure. Examination includes:

Increasing the number of Network devices.
Increasing the size of the network.
Adding ns-3 packet captures.
Adding ns-3 Wifi diagnostic logging.

Findings:

Maximum stable network size is 3 or 4 nodes.
Adding ns-3 diagnostics degrades performance.
Removing and adding network devices without rebooting degrades performance.
ns-3 diagnostics are available. ns-3 pcap reveals 802.11 MAC packets. ns-3 Wifi logs create thousands of entries that map closely to ns-3 Wifi code.
As node size increases to 5 or 6, DDS setup time takes several minutes. Setup time increases when adding ns-3 diagnostics.
There is no difference in performance between running ns-3 in wall-time mode and sim-time mode.

Example Setup

Here are setup steps:

cd ~/gits/ns3_testbed/scripts/
sudo ./nns_setup.py setup -c 5

Start ns-3 with default fixed antenna positions:

cd ~/gits/ns3_testbed/ns3_programs/build
./ns3_mobility -c 5

Using example_keep_last_5.csv:

Publish,,,,,,,
Node,Subscription,Frequency,Size,History,Depth,Reliability,Durability
R1,odometry,2,10,keep_last,5,best_effort,volatile
,,,,,,,
Subscribe,,,,,,,
Node,Subscription,History,Depth,Reliability,Durability,,
R2-5,odometry,keep_last,5,best_effort,volatile,,

start robots with one talker and four listeners:

sudo /bin/bash
cd ~/gits/ns3_testbed/cpp_testbed_runner/build/cpp_testbed_runner
./testbed_runner -c 5 -i example_keep_last_5.csv -n -v

Start Wireshark for performing packet analysis on pcap files created by ns-3 by running ns3_mobility:

wireshark

Look at one of the generated pcap files, for example ~/gits/ns3_testbed/ns3_programs/build/ns3_mobility_capture-0-0.pcap. Do this in Wireshark using menu File | Open. Scroll down to pass DDS and ad hoc setup. Here we see the expected transmits consisting of 4 messages by 8 repeats each. Increasing packet size does not cause ack to show up. Increasing packet size to 10,000 bytes causes packet fragmentation, multiple re-sends on each fragment, no ack.

Set up with 6 Tap devices:

sudo ./nns_setup.py setup -c 6

Now acks work fine. Increasing packet size to 10,000 bytes fragments as expected, ack works fine.

Running 6 Tap devices, 6 robots has ack on first four, missing ack on fifth. Same with 7 and 8 Tap devices.

Deleting and then reinstalling Tap devices produces a different setup because device indexes are not re-used. Reboot is required to recreate initial conditions.