Known issues - til-ai/til-25 GitHub Wiki
Known issues and how to work around them.
Contents
Template Code
"ImportError: attempted relative import with no known parent package" when running CV container
Line 12 of til-25/cv/cv_server.py
from .cv_manager import CVManager
This would work if the package was run as a module, but for some reason doesn't work out of the box with uvicorn
. For a working version, replace that line with:
from cv_manager import CVManager
Vertex AI Workbench
Data/team directories are missing on startup
On startup, the directories in /home/jupyter
are sometimes missing. Don't panic, your data is not lost! The data in those directories aren't stored in your local filesystem, but are instead Google Cloud Storage buckets mounted using gcsfuse
. As such, they just need to be recreated and remounted:
mkdir $TEAM_TRACK
mkdir $TEAM_NAME
sudo mount $TEAM_TRACK
sudo mount $TEAM_NAME
Unresponsive or 502 Error on JupyterLab instance
Sometimes, when you click on Open JupyterLab
, the JupyterLab environment fails to load, instead displaying "502. That's an error. That's all we know." Alternatively, the JupyterLab instance can sometimes become unresponsive. This is generally due to a transient network issue of some kind, and is usually fixed by forcing a hard browser refresh (Ctrl-Shift-R
on Windows or Cmd-Shift-R
on Mac) on your browser page.
NVIDIA driver not recognized
Sometimes, the NVIDIA drivers stop being recognized on the instance. You can see this when you run the nvidia-smi
command.
Try re-installing the CUDA drivers.
sudo apt-get purge nvidia-*
sudo apt-get update
sudo apt-get autoremove
Stop and start the instance.
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-debian10-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo dpkg -i cuda-repo-debian10-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo cp /var/cuda-repo-debian10-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
sudo apt-get install -y cuda-drivers
At the last step (re-installing CUDA drivers), select "yes" for both prompts.
If the issue persists, ping @tech
from your team's private Discord channel.
See also:
- https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/A100-GPU-VM-on-GCP-NVIDIA-SMI-has-failed-because-it-couldn-t/m-p/480629
- https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=10&target_type=deb_local
Dependencies
transformers
incompatibility with torch_xla
Versions of the Hugging Face transformers
library >4.37.0
may not work due to incompatibility with the torch_xla
library on GCP instances. Unless you have a specific need otherwise, we recommend transformers==4.37.0
, if you intend on using it.
pettingzoo
incompatibility with python>=3.13
We pin the PettingZoo version to version 1.25.0 in til-25-environment
, which is the newest version of PettingZoo released. However, it is not formally supported for python>=3.13
, and will likely throw an error during installation. It is thus advised to be using Python 3.12 for your RL training and inference.