Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 , I mean insert temporary access keys
Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak
This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi CumbersomeCormorant74 yes, this is almost the scenario: I have a dozen of projects. In one of them, I have ~20 archived experiments, in different states (draft, failed, aborted, completed). I went to this archive, selected all of them and deleted them using the bulk delete operation. I had several failed delete popups. So I tried again with smaller bulks (like 5 experiments at a time) to localize the experiments at the origin of the error. I could delete most of them. At some point, all ...
I also tried setting ebs_device_name = "/dev/sdf" - didn't work
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri parameter ignored
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesn’t work unfortunately
CostlyOstrich36 good enough, I will fallback to sorting by updated, thanks!
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
that would work for pytorch and clearml yes, but what about my local package?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
trains-elastic container fails with the following error:
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now 😄
How exactly is the clearml-agent killing the task?
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...
Hi AgitatedDove14 , initially I was doing this, but then I realised that with the approach you suggest all the packages of the local environment also end up in the “installed packages”, while in reality I only need the dependencies of the local package. That’s why I use _update_requirements , with this approach only the package required will be installed in the agent