Reputation
Badges 1
981 × Eureka!SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked 👍
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
I get the following error:
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
This works well when I run the agent in virtualenv mode (remove --docker )
Now it starts, I’ll see if this solves the issue
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
Is there one?
No, I rather wanted to understand how it worked behind the scene 🙂
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
That’s awesome!
I have the same problem, but not only with subprojects, but for all the projects, I get this blank overview tab as shown in the screenshot. It only worked for one project, that I created one or two weeks ago under 0.17
Ok, I guess I’ll just delete the whole loss series. Thanks!
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
but according to the disks graphs, the OS disk is being used, but not the data disk
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
I ended up dropping omegaconf altogether
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work
Actually I think I am approaching the problem from the wrong angle
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats 🚀 🙌 Was this bug fixed with this new version?