Reputation
Badges 1
981 × Eureka!Thanks! With this Iโll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
Restarting the server ( docker-compose down then docker-compose up ) solved the problem ๐ All experiments are back
yes but they are in plain text and I would like to avoid that
I checked the server code diffs between 1.1.0 (when it was working) and 1.2.0 (when the bug appeared) and I saw many relevant changes that can introduce this bug > https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0
Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem
With a large enough number of iterations in the for loop, you should see the memory grow over time
I think my problem is that I am launching an experiment with python3.9 and I expect it to run in the agent with python3.8. The inconsistency is from my side, I should fix it and create the task with python3.8 with:task.data.script.binary = "python3.8" task._update_script(convert_task.data.script)Or use python:3.9 when starting the agent
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking ๐
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked ๐
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
So there will be no concurrent cached files access in the cache dir?
Hi CostlyOstrich36 , there was no DB migration necessary since 1.6, right?
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: falseSo it states that IAM role using metadata service should be supported, right?
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd -> wrong numpy version
Done! Also I tried to use git cache ( https://git-scm.com/docs/git-credential-cache ) as a workaround (hoping that the first time it clones the experiment repo, it caches the creds for the next times, but I then get a different error: fatal: unable to find a suitable socket path; use --socket )
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
nvm, bug might be from my side. I will open an issue if I find any easy reproducible example
Just tried, still the same issue
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
Sure, just sent you a screenshot in PM
