Troubleshooting the Network - theunissenlab/lab-documentation GitHub Wiki

Troubleshooting

  1. I can't access one of the lab's shared data drives (tdrive, zdrive, fhome, ...)
  2. I can't log in on my Ubuntu workstation / login loop
  3. I can't see the shared data drives
  4. I can't SSH to zebra and/or finch
  5. I can't read or save to files (or directories) on zdrive or tdrive that I should be able to'
  6. I can't create new files

I can't access one of the lab's shared data drives (tdrive, zdrive, fhome, ...)

The remote mounting is configured on zebra. When connected to zebra, you can see a list of the drives configured for automount with the command ldapsearch.

To add a new entry, modify this template and create a new ldif file. Add the ldif entry using ldapadd. To do this, bind as root and use -W for a password prompt. The password is finchfinchfin. Use ldapsearch again to verify that the new entry is there.

I can't log in on my Ubuntu workstation / login loop

One common problem you may encounter on Ubuntu is that you put in your password on the login screen and it looks like its logging you in but you come back to the login screen again. This is typically caused when the user's home folder is inaccessible for some reason.

In all cases, the first thing to do is press Ctrl-Alt-F1 (Ctrl-Alt-F2 or F3 on Ubuntu 18.04) to enter the command shell (You can return to the GUI with Ctrl-Alt-F7) and try to log in.

  • If you can't log in as yourself, there is probably a problem with the network, zebra, or the local ldap config (since your user account is stored on the ldap server which is zebra). Try logging in as the local user "ubuntu" to continue debugging.

  • If you log in, see if the shared data drives are available (ls /auto/fhome/, ls /auto/zdrive/, ls /auto/tdrive. If they aren't there (especially /auto/fhome/), follow the troubleshooting for I can't see the shared data drives.

  • If you log in and the shared drives are there, it may be a problem with the display manager. It is likely caused by some software that had been installed at some point, and also occurred for me when upgrading from Ubuntu version 16.04 to 18 and 20. You can try switching the display manager and see if that fixes the problem. cat /etc/X11/default-display-manager will show you your current display manager. sudo dpkg-reconfigure gdm3 gives a window with the option of switching the default display manager. If just upgraded to 20.04, try switching to lightdm?

    • The login loop issue on corvus was solved by switching to lightdm (sudo dpkg-reconfigure lightdm) and restarting.
  • I tried this too:

# sudo apt-get install xserver-xorg-legacy
# Edit /etc/X11/Xwrapper.config
allowed_users=anybody   # used to be console?
needs_root_rights=yes
  • Finally, you could try these solutions.

  • Make sure that the setting in ~/.dmrc for Session=ubuntu (no idea but this fixed a login loop)

I can't see the shared data drives

Check if network or nashome is down

Check to see if computer is connected to the network by ping zebra. Try ping nashome (the system with our home directories on it) to see if it is accessible. If not, see if it is accessible from any of the other workstations.

Retry mounting

There is sometimes a timing bug where the computer tries to mount fhome before the network is accessible. Running sudo service autofs restart should retry the automounting. Check ls /auto/fhome and ls /auto/tdrive and ls /auto/zdrive to see if they are visible. If they are, hit Ctrl-Alt-F7 and attempt to log in again.

Still doesnt work

If /auto/fhome is still not visible, see if you can locate the activate-netlogin.sh script on the system (might be located in the root directory / from when the computer was first set up). If you find it, re-run that script (in particular the imporant line of the script is pam-auth-update). Ctrl-Alt-F7 and attempt to log in again.

Just manually mount the drives

If none of the above worked (or you didn't find the actiavte-netlogin.sh script), you can manually run the commands to mount the remote drives. If they don't exist already, you might need to mkdir /auto/zdrive, mkdir /auto/tdrive, and mkdir /auto/fhome before running these commands:

sudo mount -t nfs zdrive:/volume1/zdrive /auto/zdrive 
sudo mount -t nfs tdrive:/mnt/NAS/tdrive/tdrive /auto/tdrive 
sudo mkdir /auto/zdrive /auto/tdrive /auto/fhome (if drive, tdrive and fhome does not exist)
sudo mount -t nfs nashome:/raid/data/_NAS_NFS_Exports_/homes/fethome /auto/fhome

I can't SSH to zebra and/or finch

  • First, can you ping zebra/ping finch? If you can't there is probably an issue with the network itself.

  • You can't SSH to zebra (or finch) normally with your username and password; your SSH public key must be added to the host's list of authorized keys before it will accept your SSH connection.

  • While you can connect to zebra as any user, finch only has two users: root and ubuntu. Make sure you are connecting with ssh ubuntu@finch, not just ssh finch.

  • Try connecting to zebra with its external IP address. Same for finch. If you can't, the server you are connecting to might be down. If you can connect with the external IP but not the internal IP / hostname, then the problem is probably in the network. In particular, the link between zebra and finch may be down (see next section).

I can't connect to zebra from finch (or vice versa)

Zebra and finch are connected by something called IPsec. If this connection is broken, users will not be able to log in (since zebra is in charge of authentication, and the workstations are connected to finch).

To check this, connect to finch as user ubuntu. (Note: as of Oct 2020, I'm not sure how to ssh into finch since my connections get rejected - Kevin). While connected, check the ipsec connection with sudo ipsec status.

You should see something like this

Security Associations (1 up, 0 connecting):
   tlab-lab[1]: ESTABLISHED 2 minutes ago, 169.229.146.178[169.229.146.178]...169.229.219.169[169.229.219.169]  
   tlab-lab{1}:  INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: c2eb3265_i c82afb2f_o  
   tlab-lab{1}:   10.0.2.0/24 === 10.0.1.0/24

If you instead see

Security Associations (0 up, 0 connecting):    
   none

you should restart the tunnel using the command sudo ipsec restart. The other end should automatically reconnect.

I can't read or save to files (or directories) on zdrive or tdrive that I should be able to

If there are issues with reading or writing files that other users (or yourself) have created on the shared data drives (in /auto/zdrive, /auto/tdrive, or any subdirectories), it is possible that permissions on the NAS are not configured correctly.

A temporary fix is to add the correct permissions with chmod (for example chmod 777 filename.txt to give all permissions to everyone on filename.txt).

However, a permanent fix may involve fixing the configuration on the NAS system. Connect to zdrive's Synology Assistant software in the browser. The configuration settings we have changed are documented in this wiki - if you solve this or change any other configuration, make sure to make a note of it there.

I can't create new files

If you are getting errors related to trying to create new files, or if things are inexplicably in the middle of a script that writes to a new file, it is possible that the disk you are writing to is running out of space. This is most likely to happen on fhome, which has 1TB of space and is shared across all users. This is why all data should be saved on zdrive or tdrive.

List the size of our shared drives with

df | (read header; echo "$header"; grep /auto)