Reputation
Badges 1
981 × Eureka!I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
I also don't understand what you mean by unless the domain is different... The same way ssh keys are global, I would have expected the git creds to be used for any git operation
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get igniteโs events fired, but still no scalars logged ๐
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
AgitatedDove14 ok, but this happens in my local machine, not in the agent
So previous_task actually ignored the output_uri
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
CostlyOstrich36 I donโt see such number, can you please share a screenshot of where to look at?
Thanks, the message is not logged in GCloud instances logs when using startup scripts, this is why I did not see it. ๐
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
but according to the disks graphs, the OS disk is being used, but not the data disk
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
I get the following error:
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh ๐ I am not sure how to make my pytorch model run there
I am not sure I can do both operations at the same time (migration + splitting), do you think itโs better to do splitting first or migration first?
So that I donโt loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
automatically promote models to be served from within clearml
Yes!
AgitatedDove14 That's a good point: The experiment failing with this error does show the correct aws key:... sdk.aws.s3.key = ***** sdk.aws.s3.region = ...
even if I explicitely use previous_task.output_uri = " s3://my_bucket " , it is ignored and still saves the json file locally
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
` ssh my-instance
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:O2++ST5lAGVoredT1hqlAyTowgNwlnNRJrwE8cbM...
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True) like this:
` def log_loss(engine):
idist.barrier()
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}")
Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.itera...