Reputation
Badges 1
29 × Eureka!Ah apologies for getting the wrong end of the stick a bit!
Not sure if it helps you or not, but when the link to an artifact didn't work for me it was because the URL being used was internal to the server (I had an agent that had access to internal endpoints). In my case setting the agent fileserver url to the public domain solved my issue.
Will do! Whatβs the process for adding task.reset
to the public API, just adding it to the docs?
From my limited understanding of it, I think it's the client that does the saving and communicating to the fileserver not the server, whereas deletion is done by the GUI/server which I guess could have different permissions somehow?
It seems to be an issue that a few people are having problems with: https://github.com/allegroai/clearml-server/issues/112
Tasks are running locally and recording to our self deployed server, no output in my task log that indicates an issue. This is all of the console output:
2023-01-09 12:53:22 ClearML Task: created new task id=7f94e231d8a04a8c9592026dea89463a ClearML results page:
2023-01-09 12:53:24 ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Are there any logs in the server I can check? The server is running v1.3.1 and the issue Iβm see is with version 1....
Updating the server has solved the issue π
I think a note about the fileserver should be added to the https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_security page!
Is that deletion when deleting a task in the GUI?
That said, maybe the connect dict is not the best solution for thousand key dictionary
Seems like it isn't haha!
What is the difference with connect_configuration
? The nice thing about it not being an artifact is that we can use the gui to see which hashes have changed (which admittedly when there are a few thousand is tricky anyway)
I realise I made a mistake and hadn't actually used connect_configuration
!
I think the issue is the bandwidth yeah, for example when I doubled the number of CPUs (which doubles the allowed egress) the time taken to upload halved. It is puzzling because as you say it's not that much to upload.
For now I've whittled down the number of entries to a more select but useful few and that has solved the issue. If it crops up again I will try connect_configuration
properly.
Thanks for ...
And regarding the first question - Edit your
~/clearml.conf
That would change what file server is used by me locally or an agent yes, but I want to change what is shown by the GUI so that would need to be a setting on the server itself?
When you generate new credentials in the GUI, it comes up with a section to copy and paste into either clearml-init
or ~/clearml.conf
. I want the files server displayed here to be a GCP address
Could well be the same as https://github.com/allegroai/clearml-server/issues/112 which is also discussed https://clearml.slack.com/archives/CTK20V944/p1648547056095859 π
Yes please that would be great π
I ran into something similar, for me I'd actually cloned the repository using the address without the git@
(something made it work). ClearML read it from the remote repository URL and used it. When I updated the URL of the remote repository in my git client it then worked.
Maybe it was the load on the server? meaning dealing with multiple requests at the same time delayed the requests?!
Possibly but I think the server was fine as I could run the same task locally and it took a few seconds (rather than 75) to upload. The egress limit on the agent was 32 Gbps which seems much larger than what I though I was sending but I don't have a good idea of what that limit actually means in practice!
And what is the difference in behaviour betweenTask.init(..., output_uri=True)
and Task.init(..., output_uri=None)
?
CostlyOstrich36 thanks for getting back to me!
yes!
That's great! Please can you let me know how to do it/how to set the default files server?
However it would be advisable to also add the following argument to your code :
That's useful thanks, I didn't know about this kwarg
I've tracked down our messages when this occurred and I think we had a different error to you, sorry.
In case it helps our problem was when the below command was run in the repository:$ git remote -v
Returned the https
address rather than the ssh
address.
Then clearml tried to convert this to the ssh
address, which looked like<org>/<repo>/
rather than:<org>/<repo>.git
(Which is possible a separate bug?)
Another option would be to dotask.close() task.reset()
And then execute an agent to pick up that task, but I donβt think reset
is part of the public API. Is this risky?
Hi CostlyOstrich36 , thanks for getting back to me!
I want to launch multiple tasks from one python process to be run by multiple agents simultaneously.
My current process for launching one task remotely is to use task.execute_remotely
, and then I separately spin up a VM and execute a ClearML agent on that VM with the task ID.
Ideally, I would like to create multiple tasks in this way - so do Task.init(β¦)
, set up some configuration, and then task.execute_remotely
in a l...
I guess two more straightforward questions:
Could it be made possible for task.execute_remotely(clone=False, exit_process=False)
to not raise an exception? Im happy work on a PR if this would be possible Is there any issue to having task.reset()
in the public API/is there any potential issues with using?