Reputation
Badges 1
981 × Eureka!From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
Yes ๐ Thanks!
That's why I suspected trains was installing a different version that the one I expected
I was rather wondering why clearml was taking space while I configured it to use the /data volume. But as you described AgitatedDove14 it looks like an edge case, so I donโt mind ๐
I asked this question some time ago, I think this is just not implemented but it shouldnโt be difficult to add? I am also interested in such feature!
Mmmh unfortunately not easilyโฆ I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
Ok, deleting installed packages list worked for the first task
Hi CostlyOstrich36 , I mean insert temporary access keys
Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak
This is what I get with mprof on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi CumbersomeCormorant74 yes, this is almost the scenario: I have a dozen of projects. In one of them, I have ~20 archived experiments, in different states (draft, failed, aborted, completed). I went to this archive, selected all of them and deleted them using the bulk delete operation. I had several failed delete popups. So I tried again with smaller bulks (like 5 experiments at a time) to localize the experiments at the origin of the error. I could delete most of them. At some point, all ...
I also tried setting ebs_device_name = "/dev/sdf" - didn't work
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri parameter ignored
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesnโt work unfortunately
CostlyOstrich36 good enough, I will fallback to sorting by updated, thanks!
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
that would work for pytorch and clearml yes, but what about my local package?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
trains-elastic container fails with the following error:
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now ๐