JitteryCoyote63

215 Questions, 1023 Answers

Active since 10 January 2023

Last activity 3 months ago

Reputation

Badges 1

981 × Eureka!

Answers 1023

0 Hi, I Have Another Problem

I just started one and it wrote:
...

5 years ago

0 Hey Guys, I Am Setting Up A New Machine With Two Rtx 3070 Gpus Where I Created Two Agents (One For Each Gpu). On Both Agents, My Experiments Fail With Error:

Send via PM 🙂

4 years ago

0 Hi, I Have Another Problem

I specified a torch @ https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0

5 years ago

0 Hi, I Have Another Problem

I don't know why it didn't detect it in first place

5 years ago

0 Hi, I Have Another Problem

btw shoulnd't it be CUDA_VERSION=11.0 ?

5 years ago

0 Hi Guys, Coming This Time To Share An Idea Of A Killer Feature For Clearml

I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner

4 years ago

0 Hey Guys, I Am Setting Up A New Machine With Two Rtx 3070 Gpus Where I Created Two Agents (One For Each Gpu). On Both Agents, My Experiments Fail With Error:

Also, from https://lambdalabs.com/blog/install-tensorflow-and-pytorch-on-rtx-30-series/ :

As of 11/6/2020, you can't pip/conda install a TensorFlow or PyTorch version that runs on NVIDIA's RTX 30 series GPUs (Ampere). These GPUs require CUDA 11.1, and the current TensorFlow/PyTorch releases aren't built against CUDA 11.1. Right now, getting these libraries to work with 30XX GPUs requires manual compilation or NVIDIA docker containers.

But what wheel is downloading trains in that case?

4 years ago

0 Hi, I Have Another Problem

ho, that might be it then, thanks!

5 years ago

0 Hi, I Have Another Problem

thanks, I will do that

5 years ago

0 Congrats On The Clearml-Serving 0.9.0 Release! I’Ll Try It For Sure!

This is HUGE 🔥 🚀 🎉

3 years ago

0 Hi, Is There A Way To Get Some Stats About The Use Of Workers? I Would Like To Know, Over The Past 3 Months:

Thank you!

4 years ago

0 Hi, I Am Getting The Following Errors In The Experiments I Am Currently Running:

Thanks! With this I’ll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)

4 years ago

0 Hi, I Deleted All Archived Experiments In A Project And I Just Realized All Experiments Of All Projects Were Deleted (Clearml Server V1.0.0)

Restarting the server ( docker-compose down then docker-compose up ) solved the problem 😌 All experiments are back

4 years ago

0 Hi There, Is It Possible To Configure The Clearml-Agent To Run Some Commands Before Running Each Experiment It Launches? Eg.

yes but they are in plain text and I would like to avoid that

4 years ago

0 Hi, How Does

There was no possible cache, the agent was running on a new ec2 instance

2 years ago

0 Hi There,

Ok interestingly using matplotlib.use('agg') it doesn't leak (idea from here )

2 years ago

0 Hi, I Would Like To Follow-Up In This

I checked the server code diffs between 1.1.0 (when it was working) and 1.2.0 (when the bug appeared) and I saw many relevant changes that can introduce this bug > https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0

3 years ago

0 Hi There,

Ok so what is the value that is set when it is run by the agent? agg ?

2 years ago

0 Hi There,

Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem

2 years ago

0 Hi There,

With a large enough number of iterations in the for loop, you should see the memory grow over time

2 years ago

0 Hi, With Clearml-Agent 1.5.1, I Tried To Run An Experiment Within A Docker With Image Python3:8 And It Failed Executing The Task While Trying To Call Python3.9. I Am Not Sure Why It'S Using Python3.9, Since The Agent.Default_Python Is 3.8 And The Image Is

I think my problem is that I am launching an experiment with python3.9 and I expect it to run in the agent with python3.8. The inconsistency is from my side, I should fix it and create the task with python3.8 with:
task.data.script.binary = "python3.8" task._update_script(convert_task.data.script)Or use python:3.9 when starting the agent

2 years ago

0 Hi There,

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

2 years ago

0 Hi There,

Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄

2 years ago

0 Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Any chance this is reproducible ?

Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck

How many processes do you see running (i.e. ps -Af | grep python) ?

I will check that when the next one will be blocked 👍

What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

I train with p...

3 years ago

0 Hi, Just Want To Report A Small Bug In The Clearml Dashboard: After Queuing An Experiment, If I Change The Experiment Queue, Then Go Back To The Experiment Info Tab, The Queue Property Still Shows The Previous Queue

SuccessfulKoala55 In the new queue

4 years ago

0 Hi Guys, With The New Venv Caching Available In Clearml, I Have The Following Problem: I Force My Pip Requirements To Be:

So there will be no concurrent cached files access in the cache dir?

4 years ago

0 Hi, I Would Like To Follow-Up In This

Hi AppetizingMouse58 , I sent you the files in PM 🙂

3 years ago

0 Hi, I Recently Updated Clearml-Server To 1.7 And I Am Getting A Lot Of The Following Errors Since Today On Any Experiment (I Didn'T Had This Error Before):

Hi CostlyOstrich36 , there was no DB migration necessary since 1.6, right?

3 years ago

0 Hi, I Attached An Iam Role To An Ec2 Instance To Grant Access To An S3 Bucket. The Ec2 Instance Is Running A Clearml-Agent (V1.1.0). I Didn’T Specify Any Key/Secret For Clearml. The Tasks Fail With The Following Error:

I am confused now because I see in the master branch, the clearml.conf file has the following section:
# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: falseSo it states that IAM role using metadata service should be supported, right?

4 years ago

0 Hi, I Have An Error With Clearml-Agent 1.5.1 When Importing Tensorflow 2.10

Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd -> wrong numpy version

2 years ago

Show more results