Badges 129 × Eureka!
Updating the server has solved the issue 👍
Thanks @<1523701087100473344:profile|SuccessfulKoala55> , I’ve taken a look and is this force merging you’re referring to? Do you know how often ES is configured to merge in clearml server?
I guess two more straightforward questions:
Could it be made possible for
task.execute_remotely(clone=False, exit_process=False) to not raise an exception? Im happy work on a PR if this would be possible Is there any issue to having
task.reset() in the public API/is there any potential issues with using?
And here is a PR for the other part.
Hi CostlyOstrich36 , thanks for getting back to me!
I want to launch multiple tasks from one python process to be run by multiple agents simultaneously.
My current process for launching one task remotely is to use
task.execute_remotely , and then I separately spin up a VM and execute a ClearML agent on that VM with the task ID.
Ideally, I would like to create multiple tasks in this way - so do
Task.init(…) , set up some configuration, and then
task.execute_remotely in a l...
Will do! What’s the process for adding
task.reset to the public API, just adding it to the docs?
it might be an issue in the UI due to this unconventional address or network settings
I think this is related to an https://github.com/allegroai/clearml-server/issues/112#issue-1149080358 that seems to be a reoccurring issue across many different setups
Another option would be to do
task.close() task.reset()And then execute an agent to pick up that task, but I don’t think
reset is part of the public API. Is this risky?
OK that's great, thanks for the info SuccessfulKoala55 👍
Awesome, thanks SuccessfulKoala55 .
Hi CostlyOstrich36 thanks for the response and makes sense.
What sort of problems could happen, would it just be the corruption of the data that is being written or could it be more breaking?
For context, I’m currently backing up the server (spinning it down) every night but now need to run tasks over night and don’t want to have any missed logs/artifacts when the server is shutdown.
From my limited understanding of it, I think it's the client that does the saving and communicating to the fileserver not the server, whereas deletion is done by the GUI/server which I guess could have different permissions somehow?
It seems to be an issue that a few people are having problems with: https://github.com/allegroai/clearml-server/issues/112
Shards that I can see are using a lot of disk space are
- And then various
Ah right, nice! I didn’t think it was as I couldn’t see it in the
Task reference , should it be there too?
I realise I made a mistake and hadn't actually used
I think the issue is the bandwidth yeah, for example when I doubled the number of CPUs (which doubles the allowed egress) the time taken to upload halved. It is puzzling because as you say it's not that much to upload.
For now I've whittled down the number of entries to a more select but useful few and that has solved the issue. If it crops up again I will try
Thanks for ...
I think a note about the fileserver should be added to the https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_security page!
Yep can do 👍
CostlyOstrich36 thanks for getting back to me!
That's great! Please can you let me know how to do it/how to set the default files server?
However it would be advisable to also add the following argument to your code :
That's useful thanks, I didn't know about this kwarg
And regarding the first question - Edit your
That would change what file server is used by me locally or an agent yes, but I want to change what is shown by the GUI so that would need to be a setting on the server itself?
connect_configuration seems to take about the same amount of time unfortunately!
Maybe it was the load on the server? meaning dealing with multiple requests at the same time delayed the requests?!
Possibly but I think the server was fine as I could run the same task locally and it took a few seconds (rather than 75) to upload. The egress limit on the agent was 32 Gbps which seems much larger than what I though I was sending but I don't have a good idea of what that limit actually means in practice!
Yeah it's strange isn't it!
CumbersomeCormorant74 just to confirm in my case the file's aren't actually deleted - I have to manually delete them from the fileserver via a terminal
CostlyOstrich36 I use the GCP disk image to launch a Compute Engine instance which sits behind an HTTP load balancer
Is the GCP disk image released for it? I get access denied with this link: https://storage.googleapis.com/allegro-files/clearml-server/clearml-server-1-3-0.tar.gz
No worries, thanks for sorting it! 🙂
That's great, thank you very much. Will give it a whirl today or tomorrow