Reputation
Badges 1
56 × Eureka!The task is registered and is started by the agent, the env seems to be installed well, but then it fails on /home/ubuntu/.clearml/venvs-builds/3.8/bin/python: can't open file 'fastai_classifier.py': [Errno 2] No such file or directory
Do you have an idea of what could be wrong ? The agent launch the script in the wrong working dir ? The repo is not copied ? (This script is inside a private git repo, that clearml detects correctly).
I also tried launching the script from the root of th...
I think I found the problem, if the file is untracked by git, it is not saved by clearml
However I have another problem, my git repo is installed with pip install -e .
and I import it in my script, but on a task executed by a clearml-agent the module appears not to be installed ?
I think didn't understand, if I'm not at the root of the repo, I have to specify the working dir ?
Hmm apparently if I launch the script from the root of the repo (CWD: myrepo python train/classif-custom/train.py
) it works, but from its dir it doesn't work (CWD: myrepo/train/classif-custom python train.py
)
oookay so we found that for kubernetes, if we allow only tls v1.3 on the ingress controller, clearml-inits breaks with 2022-03-04 10:32:02,814 - clearml.session - WARNING - SSLError Retrying HTTPSConnectionPool(host='
http://api.clear-ml.dev.monk.ai ', port=443): Max retries exceeded with url: /auth.login (Caused by SSLError(SSLError(1, '[SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:1129)')))
or sometimes just could not verify credentials
ok so I reproduced it with this, it happens when I have colors (I got the error first with an exception printed with stackprinter None )
Task.init(project_name="test", task_name="test", reuse_last_task_id=False)
print("this is a test <hello world> rest of the text")
print("this is a test <hello world> rest of the text", file=sys.stderr)
print(colorama.Fore.RED + "this is a test <hello world> rest of the text" + colorama.Style.RESET_ALL)
![i...
I used scripts like https://github.com/allegroai/clearml-server/issues/83 previously for images but it doesn't migrate artifacts urls
quick video of the search not working
We have the same issue for hyperparameters even with only ~100 keys, where the UI likes to lazy load and remove scrolled elements so it breaks browser search, and integrated search works like 15% of the time…
Hello, sorry the second is for models and not images
Is there a way to check how clearml gets the installed packages of the current env ?
It works with post_packages
Yes I think it needs pytorch, but pytorch failed to install previously ?
Hmm it's both better and worse, it does detect pyfunctional now (in INSTALLED PACKAGES and I can see it installed in the console logs) but it fails onimport torch ModuleNotFoundError: No module named 'torch'
In the logs:
` Found PyTorch version torch==1.7.1 matching CUDA version 110
2021-04-21 15:15:11
Found PyTorch version torchvision==0.8.2 matching CUDA version 110
Collecting torch==1.7.1+cu110
File was already downloaded /home/ubuntu/.clearml/pip-download-cache/cu110/torch-1.7.1+cu110...
Ok, btw I used https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_agent_install_configure.html which was not updated so I didn't know there was a priority_packages and post_packages
Yes the setup.py imports torch unfortunately https://github.com/mapillary/inplace_abn/blob/master/setup.py
WebApp: 1.2.0-153 • Server: 1.2.0-153 • API: 2.16
I'm not using clearml-agent here, I use clearml.Task.init.
The exit(1) (or raised exception) is from a subprocess.
clearml==1.1.3
torch==1.9.0+cu111, torchvision==0.10, lightning not installed
python3.8
debian 10
I will try reproducing with a smaller code, it was a training with detectron2 which uses torch.,multiprocessing.spawn and torch.distributed.init_process_group
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/tools/train_net.py#L163
https...