Ad

Failure On Local Socket Bind When Wifi Drops

We are getting this strange issue on a raspberry pi.

We run a service on a socket that should work for both local and remote clients via wifi. The trouble is that stopping the remote network also stops connections from local clients.

Our python server sets up a socket like this:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, 1)
s.settimeout(2)
s.bind(("", 8888))

while True:

    try:
        conn, addr = s.accept()
    except socket.timeout:
        print("Socket timeout on s.accept(), continuing")
        continue

    #do stuff

We have a local node client running a loop like this every second or so (and actually sending data):

// every second
socket.connect("localhost", "8888" );
socket.on('connect', function() { /* do stuff */ });
socket.on('error', function(ex) { });

Everything runs fine until we cut wifi. We server side times-out on s.accept and we see the error message in our logs.

I think that the socket is bound to listen on 0.0.0.0 but somehow does not fail over to 127.0.0.1 or some sort of strange routing situation occurs.

netstat -an | grep 8888 gives

tcp        0      0 0.0.0.0:8888            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:8888          127.0.0.1:52794         TIME_WAIT
tcp        0      0 127.0.0.1:8888          127.0.0.1:52724         TIME_WAIT
tcp        0      0 127.0.0.1:8888          127.0.0.1:52740         TIME_WAIT
tcp        0      0 127.0.0.1:8888          127.0.0.1:52778         TIME_WAIT

netstart -rn gives

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         192.168.1.1     0.0.0.0         UG    304    0        0 wlan0
192.168.1.0     0.0.0.0         255.255.255.0   U     304    0        0 wlan0

I'm guessing that we just need a localhost route?

The local connections establish again when the wifi comes back up. So I don't there is some permanent dropping of the bind in the python socket.

the hosts line in /etc/nsswitch.conf gives

hosts:          files mdns4_minimal [NOTFOUND=return] dns

We monitored ping to localhost during the test and it continues to function fine. We also monitored netstat to see that port stays LISTENING on 0.0.0.0 Perhaps this is the issue?

Ad

Answer

Easiest Solution

It looks like you should avoid any naming by using "127.0.0.1" as described in our comment discussion.

In more detail:

According to the source and the docs (after nodejs first tests for an ip,) it checks if you've provided a lookup function as an option to connect, if not, it does its own "dns.lookup" call as the default. Despite the name, this function is actually trying to use system naming but might be subtly different, for example it may try to prefer ipv6.

To debug further you could try to make a more direct test case with dns.lookup and compare things like the output of getent ahosts|ahostsv4|ahostsv6 localhost against your different systems and when the wifi is down, as well as comparing other configuration like the gai.conf to try to determine if system naming is a bit different on this system or being given slightly different requests.

Ad
source: stackoverflow.com
Ad