Reputation
Badges 1
981 × Eureka!I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?
it would be nice if Task.connect_configuration could support custom yaml file readers for me
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
Are you planning to add a server-backup service task in the near future?
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
AgitatedDove14 This looks awesome! Unfortunately this would require a lot of changes in my current code, for that project I found a workaround 🙂 But I will surely use it for the next pipelines I will build!
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false (because I had the feeling that Task._report_subprocess_enabled = False wasn’t taken into account) I’ve set task.set_initial_iteration(0) Now I was able to get the followin graph after resuming -
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
SuccessfulKoala55 , This is not the exact corresponding request (I refreshed the tab since then), but the request is an events.get_task_logs , with the following content:
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
I also don't understand what you mean by unless the domain is different... The same way ssh keys are global, I would have expected the git creds to be used for any git operation
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
AgitatedDove14 ok, but this happens in my local machine, not in the agent
So previous_task actually ignored the output_uri
The clean up service is awesome, but it would require to have another agent running in services mode in the same machine, which I would rather avoid
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
CostlyOstrich36 I don’t see such number, can you please share a screenshot of where to look at?
Thanks, the message is not logged in GCloud instances logs when using startup scripts, this is why I did not see it. 👍
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other