Reputation
Badges 1
981 × Eureka!Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
CostlyOstrich36 I don’t see such number, can you please share a screenshot of where to look at?
Thanks, the message is not logged in GCloud instances logs when using startup scripts, this is why I did not see it. 👍
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
but according to the disks graphs, the OS disk is being used, but not the data disk
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
I get the following error:
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh 😄 I am not sure how to make my pytorch model run there
I am not sure I can do both operations at the same time (migration + splitting), do you think it’s better to do splitting first or migration first?
So that I don’t loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
automatically promote models to be served from within clearml
Yes!
AgitatedDove14 That's a good point: The experiment failing with this error does show the correct aws key:... sdk.aws.s3.key = ***** sdk.aws.s3.region = ...
even if I explicitely use previous_task.output_uri = " s3://my_bucket " , it is ignored and still saves the json file locally
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
` ssh my-instance
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:O2++ST5lAGVoredT1hqlAyTowgNwlnNRJrwE8cbM...
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True) like this:
` def log_loss(engine):
idist.barrier()
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}")
Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.itera...
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
I found, the filter actually has to be an iterable:Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(type=["training"])))
The task with id a445e40b53c5417da1a6489aad616fee is not aborted and is still running
no it doesn't! 3. They select any point that is an improvement over time
no, at least not in clearml-server version 1.1.1-135 • 1.1.1 • 2.14