Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 ! no I am running on venv mode
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it π
Trying now your code⦠should take a couple of mins
Ho the object is actually available in previous_task.artifacts
Yes, I stayed with an older version for a compatibility reason I cannot remember now π - just tested with 1.1.2 and itβs the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = " s3://my_bucket " had no effect (it was placed BEFORE the training)
I finally found a workaround using cache, will detail the solution in the issue π
Would adding a ILM (index lifecycle management) be an appropriate solution?
Oops, I spoke to fast, the json is actually not saved in s3
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
AgitatedDove14 Should I create an issue for this to keep track of it?
Ho, actually this was raised already here
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__ function), and I would like to have these infos logged by clearml
So last version of the agent working for me is 1.9.3
Try to spin up the instance of that type manually in that region to see if it is available
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
ha sorry itβs actually the number of shards that increased
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
Should I open an issue in github clearml-agent repo?
I hitted enter too fast ^^
Installing them globally via$ pip install numpy opencv torch will install locally with warning:Defaulting to user installation because normal site-packages is not writeable , therefore the installation will take place in ~/.local/lib/python3.6/site-packages , instead of the default one. Will this still be considered as global site-packages and still be included in experiments envs? From what I tested it does
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good π
ok, so even if that guy is attached, it doesnβt report the scalars
SmugDolphin23 Actually adding agent.python_binary didn't work, it was not read by the clearml agent (in the logs dumped by the agent, agent.python_binary = (no value)
Nevermind, I just saw report_matplotlib_figure π
