Reputation
Badges 1
981 × Eureka!Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
Basically what I did is:
` if task_name is not None:
project_name = parent_task.get_project_name()
task = Task.get_task(project_name, task_name)
if task is not None:
return task
Otherwise here I create the Task `
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis
I see 3 agents in the "Workers" tab
On clearml or clearml-server?
the deep learning AMI from nvidia (Ubuntu 18.04)
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should “merge” all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
yes but they are in plain text and I would like to avoid that
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked 👍
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
I get the following error:
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
Now it starts, I’ll see if this solves the issue
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
Is there one?
No, I rather wanted to understand how it worked behind the scene 🙂
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
That’s awesome!
I have the same problem, but not only with subprojects, but for all the projects, I get this blank overview tab as shown in the screenshot. It only worked for one project, that I created one or two weeks ago under 0.17
Ok, I guess I’ll just delete the whole loss series. Thanks!
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
but according to the disks graphs, the OS disk is being used, but not the data disk
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...