Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = " s3://my_bucket " had no effect (it was placed BEFORE the training)
I finally found a workaround using cache, will detail the solution in the issue 👍
Oops, I spoke to fast, the json is actually not saved in s3
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
Ho, actually this was raised already here
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__ function), and I would like to have these infos logged by clearml
So last version of the agent working for me is 1.9.3
Try to spin up the instance of that type manually in that region to see if it is available
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
Should I open an issue in github clearml-agent repo?
I hitted enter too fast ^^
Installing them globally via$ pip install numpy opencv torch will install locally with warning:Defaulting to user installation because normal site-packages is not writeable , therefore the installation will take place in ~/.local/lib/python3.6/site-packages , instead of the default one. Will this still be considered as global site-packages and still be included in experiments envs? From what I tested it does
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good 😄
Nevermind, I just saw report_matplotlib_figure 🎉
Here is the console with some errors
yes, in the code, i do:task._wait_for_repo_detection() REQS_TASK = ["torch==1.3.1", "pytorch-ignite @ git+ ", "."] task._update_requirements(REQS_TASK) task.execute_remotely(queue_name=args.queue, clone=False, exit_process=True)
Notice the last line should not have
--docker
Did you meant --detached ?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR 🙂
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.
There is an example in the https://github.com/allegroai/clearml/blob/master/docs/datasets.md#workflow section of the linked I shared above
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Now it starts, I’ll see if this solves the issue
Also what is the benefit of having by default index.number_of_shards = 1 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?

