
Reputation
Badges 1
56 × Eureka!I'm not using clearml-agent here, I use clearml.Task.init.
The exit(1) (or raised exception) is from a subprocess.
clearml==1.1.3
torch==1.9.0+cu111, torchvision==0.10, lightning not installed
python3.8
debian 10
I will try reproducing with a smaller code, it was a training with detectron2 which uses torch.,multiprocessing.spawn and torch.distributed.init_process_group
https://github.com/facebookresearch/detectron2/blob/c47167e4ac236a36895c294735a908b75f659f96/tools/train_net.py#L163
https...
Hello AgitatedDove14 it does not throw an exception, but in the ui the link is broken so the image does not show
WebApp: 1.2.0-153 • Server: 1.2.0-153 • API: 2.16
Is there a way to check how clearml gets the installed packages of the current env ?
Hmm it's both better and worse, it does detect pyfunctional now (in INSTALLED PACKAGES and I can see it installed in the console logs) but it fails onimport torch ModuleNotFoundError: No module named 'torch'
In the logs:
` Found PyTorch version torch==1.7.1 matching CUDA version 110
2021-04-21 15:15:11
Found PyTorch version torchvision==0.8.2 matching CUDA version 110
Collecting torch==1.7.1+cu110
File was already downloaded /home/ubuntu/.clearml/pip-download-cache/cu110/torch-1.7.1+cu110...
made a PR to help a bit loading console logs None
logs can be huge but are loaded 7kB at a time currently
100+ parameters is quite a lot indeed but very quickly achieved when using frameworks like detectron2, where you configure the model in the configuration (+dataloader, datasets, evaluators, augmentation, optimizer, lr_scheduling). anyway the search is broken as soon as one line you search is not currently visible, so already with 20+ ...
We have the same issue for hyperparameters even with only ~100 keys, where the UI likes to lazy load and remove scrolled elements so it breaks browser search, and integrated search works like 15% of the time…
and ctrl-f (of the browser) doesn’t work as lines below not loaded (even when you scroll it will remove the other lines not visible, so you can’t ctrl-f them)
It works with post_packages
quick video of the search not working
we managed to upgrade it but the volume claim thing changed somehow, it created new disks, i will backup from the old disks and upload to the new ones to migrate but the backup procedure is not detailed for kubernetes, do you have info for this?
should i only do mongodb?
Yes I think it needs pytorch, but pytorch failed to install previously ?
Ok, btw I used https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_agent_install_configure.html which was not updated so I didn't know there was a priority_packages and post_packages
Ok thanks I will check first for permission issues
managed a workaround thanks to the API doc, if someone encouters the same bug:tasks = [] page = 0 while True: page_tasks = Task._query_tasks(project_name=project, system_tags=[] if archived else ['-archived'], page=page, page_size=500) tasks += page_tasks page += 1 if len(page_tasks) < 500: break
Hello, sorry the second is for models and not images
The task is registered and is started by the agent, the env seems to be installed well, but then it fails on /home/ubuntu/.clearml/venvs-builds/3.8/bin/python: can't open file 'fastai_classifier.py': [Errno 2] No such file or directory
Do you have an idea of what could be wrong ? The agent launch the script in the wrong working dir ? The repo is not copied ? (This script is inside a private git repo, that clearml detects correctly).
I also tried launching the script from the root of th...
ok so I reproduced it with this, it happens when I have colors (I got the error first with an exception printed with stackprinter None )
Task.init(project_name="test", task_name="test", reuse_last_task_id=False)
print("this is a test <hello world> rest of the text")
print("this is a test <hello world> rest of the text", file=sys.stderr)
print(colorama.Fore.RED + "this is a test <hello world> rest of the text" + colorama.Style.RESET_ALL)
![i...
I think I found the problem, if the file is untracked by git, it is not saved by clearml
Does clearml-agent install the repo with pip install -e .
if it should be ? (i.e. my local repo is installed with pip install -e .
where I launch my script which calls Task.init
and .execute_remotely()
).
The script is inside a git repo (and it's the one I launch, I would get an importerror if it was something else missing)
We tried with a docker-compose on a GCE VM + load balancers, and then in kube, we get the same error: clearml-init
returns Error: could not verify credentials: key=241... secret=NhC...