I specified a torch @ https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
Also, from https://lambdalabs.com/blog/install-tensorflow-and-pytorch-on-rtx-30-series/ :
As of 11/6/2020, you can't pip/conda install a TensorFlow or PyTorch version that runs on NVIDIA's RTX 30 series GPUs (Ampere). These GPUs require CUDA 11.1, and the current TensorFlow/PyTorch releases aren't built against CUDA 11.1. Right now, getting these libraries to work with 30XX GPUs requires manual compilation or NVIDIA docker containers.
But what wheel is downloading trains in that case?
Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
Restarting the server ( docker-compose down then docker-compose up ) solved the problem 😌 All experiments are back
yes but they are in plain text and I would like to avoid that
I checked the server code diffs between 1.1.0 (when it was working) and 1.2.0 (when the bug appeared) and I saw many relevant changes that can introduce this bug > https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
With a large enough number of iterations in the for loop, you should see the memory grow over time
I think my problem is that I am launching an experiment with python3.9 and I expect it to run in the agent with python3.8. The inconsistency is from my side, I should fix it and create the task with python3.8 with:task.data.script.binary = "python3.8" task._update_script(convert_task.data.script)Or use python:3.9 when starting the agent
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked 👍
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
So there will be no concurrent cached files access in the cache dir?
Hi CostlyOstrich36 , there was no DB migration necessary since 1.6, right?
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: falseSo it states that IAM role using metadata service should be supported, right?
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd -> wrong numpy version
