JitteryCoyote63

214 Questions, 1021 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

979 × Eureka!

Questions 214
Answers 1021

0 Votes

12 Answers

1K Views

0 Votes 12 Answers 1K Views

Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Hey there, since a bit I often find experiments being stuck while training a model. It seems to happen randomly and I could not find a reproducible scenario ...

mlops

2 years ago

0 Votes

9 Answers

1K Views

0 Votes 9 Answers 1K Views

Hi, I Want To Upgrade Clearml Server From 1.1 To 1.2 (Self Hosted). I Have The Following Setup:

Hi, I want to upgrade clearml server from 1.1 to 1.2 (self hosted). I have the following setup: /dev/nvme0n1p1 30G 21G 8.9G 70% / <- This is where /opt/clear...

clearml

2 years ago

0 Votes

10 Answers

1K Views

0 Votes 10 Answers 1K Views

Hi Guys, Any Plan To Integrate The

Hi guys, any plan to integrate the https://github.com/allegroai/trains-agent/blob/master/examples/dynamic_cloud_cluster.ipynb in trains-server? The code ther...

clearml

4 years ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Hi, I Recently Updated My Clearml To 1.1.2 And A Code That Was Working Before Now Behaves Completely Differently: I Am Using The Following To Log Debug Samples:

Hi, I recently updated my clearml to 1.1.2 and a code that was working before now behaves completely differently: I am using the following to log debug sampl...

clearml

3 years ago

0 Votes

26 Answers

1K Views

0 Votes 26 Answers 1K Views

Hi, I Attached An Iam Role To An Ec2 Instance To Grant Access To An S3 Bucket. The Ec2 Instance Is Running A Clearml-Agent (V1.1.0). I Didn’T Specify Any Key/Secret For Clearml. The Tasks Fail With The Following Error:

Hi, I attached an IAM role to an ec2 instance to grant access to an s3 bucket. The ec2 instance is running a clearml-agent (v1.1.0). I didn’t specify any key...

aws

3 years ago

0 Votes

8 Answers

967 Views

0 Votes 8 Answers 967 Views

Hi! I Have A Question Regarding Performances Of The Clearml-Server: Are The Calls From The Agents Made Asynchronously/In A Non Blocking Separate Thread? Is The Connection To The Clearml-Server Expected To Be A Bottleneck If The Clearml-Server Is Far From

Hi! I have a question regarding performances of the clearml-server: are the calls from the agents made asynchronously/in a non blocking separate thread? is t...

clearml

3 years ago

0 Votes

1 Answers

997 Views

0 Votes 1 Answers 997 Views

Hi, Would It Be Possible To Parse Torch Requirement When It’S Part Of The Extras_Require Dict? In My Code, I Have The Following:

Hi, would it be possible to parse torch requirement when it’s part of the extras_require dict? In my code, I have the following: train_task._update_requireme...

mlops

3 years ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Is There An Option To Make Trains-Agent Create Experiment Virtualenvs With

Is there an option to make trains-agent create experiment virtualenvs with --system-site-packages parameter?

clearml

4 years ago

0 Votes

1 Answers

998 Views

0 Votes 1 Answers 998 Views

Hi, I Have A Clearml-Agent (1.1.2) In A G4Dn.4Xlarge Aws Instance (With One T4 Gpu), That Reports

Hi, I have a clearml-agent (1.1.2) in a g4dn.4xlarge AWS instance (with one T4 GPU), that reports agent.cuda_version = 0 agent.cudnn_version = 0and does not ...

clearml

2 years ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hi Again, My Clearml Api-Server Is Having A Memory Leak. Each Time I Restart It, Its Ram Consumption Grows Until Getting Oom, Is Not Killed And Make The Ec2 Instance Crash

Hi again, my clearml api-server is having a memory leak. Each time I restart it, its ram consumption grows until getting OOM, is not killed and make the ec2 ...

clearml

3 years ago

0 Votes

19 Answers

1K Views

0 Votes 19 Answers 1K Views

Hi, With Clearml-Agent 1.5.1, I Tried To Run An Experiment Within A Docker With Image Python3:8 And It Failed Executing The Task While Trying To Call Python3.9. I Am Not Sure Why It'S Using Python3.9, Since The Agent.Default_Python Is 3.8 And The Image Is

Hi, with clearml-agent 1.5.1, I tried to run an experiment within a docker with image python3:8 and it failed executing the task while trying to call python3...

clearml

one year ago

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Hi, In The Aws Autoscaler, Is It Possible To Specify Multiple Regions (Availability_Zone)? I Currently Use Eu-West-1A, And Would Like To Start Using Eu-West-1B And Eu-West-1C. I Tried Specifying A List In Availability_Zone Parameter, But Without Success:

Hi, in the aws autoscaler, is it possible to specify multiple regions (availability_zone)? I currently use eu-west-1a, and would like to start using eu-west-...

aws

3 years ago

0 Votes

10 Answers

1K Views

0 Votes 10 Answers 1K Views

Hi, I Have A Local Package That I Use To Train My Models. To Start Training, I Have A Script That Calls

Hi, I have a local package that I use to train my models. To start training, I have a script that calls task._update_requirements([".", "torch==1.11.0"]) . I...

mlops

2 years ago

0 Votes

0 Answers

1K Views

0 Votes 0 Answers 1K Views

Hello, Pytorch 1.8 Was Released, Bringing Amd Wheels With It > Pip Install Torch -F

Hello, Pytorch 1.8 was released, bringing AMD wheels with it > pip install torch -f https://download.pytorch.org/whl/rocm4.0.1/torch_stable.html Is ClearML s...

clearml

3 years ago

0 Votes

4 Answers

996 Views

0 Votes 4 Answers 996 Views

Hey There, Happy New Year To All Of You

Hey there, happy new year to all of you 🍾 I have several tasks that are stuck while training a model with pytorch/ignite, more precisely right after uploadi...

mlops

4 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hi Guys, Following Up On This

Hi guys, following up on this https://allegroai-trains.slack.com/archives/CTK20V944/p1599135173096200?thread_ts=1599125260.076600&cid=CTK20V944 : I have a pi...

clearml

4 years ago

0 Votes

2 Answers

934 Views

0 Votes 2 Answers 934 Views

Hi, In The Metric Snapshot Graph, Is It Possible To Scale The Y Axis To

Hi, in the Metric Snapshot graph, is it possible to scale the Y axis to [y_min *0.9, y_max * 1,1] ? currently all my values are flat at the same ~y and it is...

clearml

3 years ago

0 Votes

12 Answers

1K Views

0 Votes 12 Answers 1K Views

Hi, Where Can I Find The Server Parameter To Control When The Server Is Unregistering An Agent After Not Receiving Updates? Currently It'S Quite Long (30Mins) And This Prevents The Autoscaler From Launching A New Agent

Hi, where can I find the server parameter to control when the server is unregistering an agent after not receiving updates? Currently it's quite long (30mins...

mlops

one year ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hi, Some Properties Of The Task Object Are Not Listed In The Documentation (Such As Task.Parent, Which Is Not Clear Whether It Is The Parent Task Object Itself Or The Id Of The Parent Task).

Hi, some properties of the Task object are not listed in the documentation (such as task.parent, which is not clear whether it is the parent task object itse...

clearml

4 years ago

0 Votes

18 Answers

1K Views

0 Votes 18 Answers 1K Views

Hi, Kudos For The 0.15 Guys! I Am Having An Issue Related To Git Auth: I Have An Issue With Trains-Agent (0.15): It Does Not Use Git Creds While Trying To Clone A Private Repo:

Hi, kudos for the 0.15 guys! I am having an issue related to git auth: I have an issue with trains-agent (0.15): it does not use git creds while trying to cl...

mlops

4 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, One More Question: When Creating A Task With Task.Init(), We Can Specify The

Hi, one more question: When creating a task with Task.init(), we can specify the task_type . Now when using Task.clone(), we cannot specify the task_type (is...

clearml

4 years ago

0 Votes

3 Answers

957 Views

0 Votes 3 Answers 957 Views

Hi, I Am Getting An Error While Running

Hi, I am getting an error while running task.mark_stopped() , any idea why? (clearml 1.0.2, clearml-agent 1.0.0, python 3.6) File "/home/machine/.clearml/ven...

clearml

3 years ago

0 Votes

1 Answers

1K Views

0 Votes 1 Answers 1K Views

Hi There, Would It Be Possible For The Autoscaler To Support Stopping Instances Instead Of Terminating Them? My Use Case Is The Following: I Am Continuing My Journey With The Clearml-Session Tool, And In Case The Clearml-Session Is Running In A Ec2 Inst

Hi there, would it be possible for the autoscaler to support stopping instances instead of terminating them? My use case is the following: I am continuing my...

mlops remote-ssh

2 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, I Think There Is A Small Bug In The

Hi, I think there is a small bug in the Experiment running time column of the workers-and-queues/workers page: they do not match the time reported in the exp...

clearml

3 years ago

0 Votes

2 Answers

985 Views

0 Votes 2 Answers 985 Views

Hello, What Is The Default Limit For Global Context ?

Hello, what is the default limit for global context ? https://allegro.ai/docs/storage_manager_storagemanager.html#trains.storage.manager.StorageManager.get_l...

clearml

4 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, I Am Currently Using

Hi, I am currently using CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS when starting my clearml-agent and I would like to switch to using a single auth t...

clearml

2 years ago

0 Votes

26 Answers

1K Views

0 Votes 26 Answers 1K Views

Hi, I Would Like To Follow-Up In This

Hi, I would like to follow-up in this https://clearml.slack.com/archives/CTK20V944/p1646123127790389 happening on clearml server 1.2.0 (self hosted on a sing...

aws mlops

2 years ago

0 Votes

1 Answers

1K Views

0 Votes 1 Answers 1K Views

Hi, I Encounter The Following Bug With Clearml 0.17.5Rc2: When I Start A Task Locally And That Task Raises Cuda Out Of Memory, The Command Returns But The Process Is Not Killed, And Therefore The Gpu Ram Is Not Freed

Hi, I encounter the following bug with clearml 0.17.5rc2: When I start a task locally and that task raises cuda out of memory, the command returns but the pr...

clearml

3 years ago

0 Votes

13 Answers

2K Views

0 Votes 13 Answers 2K Views

Hi, I Am Trying To Use The Clearml-Agent In Docker Mode To Run An Experiment, But It Seems To Fail Passing The Clearml.Conf File To The Docker Container:

Hi, I am trying to use the clearml-agent in docker mode to run an experiment, but it seems to fail passing the clearml.conf file to the docker container: Exe...

clearml

2 years ago

0 Votes

4 Answers

1K Views

0 Votes 4 Answers 1K Views

Hello There! I Have A Question Regarding The Web Ui, On The Project Page: I Have The Following Use Case: I Need To Add Two Custom Columns, Each Reporting One Metric. Currently, This Shows Me The Best (Min/Max) Values Reached By The Model, But Not Necessar

Hello there! I have a question regarding the Web UI, on the project page: I have the following use case: I need to add two custom columns, each reporting one...

clearml

2 years ago

Show more results

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

Hey SuccessfulKoala55 , unfortunately this doesn’t work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite and will make OmegaConf.create fail

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

Otherwise I can try loading the file with custom loader, save as temp file, pass the temp file to connect_configuration, it will return me another temp file with overwritten config, and then pass this new file to OmegaConf

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

Ok, so what worked for me in the end was:
config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

ProxyDictPostWrite._to_dict() will recursively convert to dict and OmegaConf will not complain then

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

it would be nice if Task.connect_configuration could support custom yaml file readers for me

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

but then why do I have to do task.connect_configuration(read_yaml(conf_path))._to_dict() ?
Why not task.connect_configuration(read_yaml(conf_path)) simply?
I mean what is the benefit of returning ProxyDictPostWrite instead of a dict?

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

erf, I have the same problem with ProxyDictPreWrite 😄 What is the use case of this one ?

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

I ended up dropping omegaconf altogether

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

Yes that’s what I did initially, but eventually I decided that it’s too much complexity added for nothing really, I’d rather drop omegaconf and if one day clearml supports it out of the box take advantage of it

2 years ago

0 Hi, If I Am Starting My Training With The Following Command:

AgitatedDove14 Same problem with clearml==1.1.5rc2 😞 , I also tried with backend==gloo , still same problem

3 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

I am not using hydra, I am reading the conf with:
config_dict = read_yaml(conf_yaml_path) config = OmegaConf.create(task.connect_configuration(config_dict))

2 years ago

0 Hi, I Am Trying To Use Omegaconf With Task.Connect_Configuration And I Get The Following Error:

Yea, the config is not appearing in the webUI anymore with this method 😞

2 years ago

0 Hi, I Face A Strange Behavior From The Clearml-Agent: It’S Running In Services Mode, Not In Docker Mode, Cpu Only. I Want To Execute Two Tasks On This Service Agent. One Works, The Other Always Fails After Being Enqueued And Picked By The Agent With The E

Can I simply set agent.python_binary = path/to/conda/python3.6 ?

3 years ago

even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary

2 years ago

0 Hi Guys, Following Up On This

continue_last_task is almost what I want, the only problem with it is that it will start the task even if the task is completed

4 years ago

0 We Can’T Add Overview To The Subprojects (Btw Thank You So Much For Subprojects, This Is Probably The Best Feature Ever Introduced To Trains/Clearml). Is It Intended? When I Click Overview For The Subproject, It Just Shows An Empty Page Without Any Button

For new projects it works 🙂

3 years ago

0 Hi There, I Used

UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id()) worked, thanks

2 years ago

0 Hi, Are The Experiments Logs Stored In S3 Or In The Trains-Server? (When Using S3 As Artifact Storage)

Ok thanks!

3 years ago

0 Hi, Is There A Way To Update The Setup Shell Script Via The Sdk?

Answering myself: Yes, Task.set_base_docker RTFM!!!

one year ago

0 Hi Again, I Am Trying To Make The Aws Autoscaler Work With Ec2 Instances, But It Fails To Setup The Agent In The Machine: The Logs Of The User-Data Script Show That It Fails Updating The Machine (See Below)

doesn’t really work unfortunately

3 years ago

0 Hello, I Would Like To Use Spot Instances Together With The Aws Autoscaler To Train Models With Pytorch/Ignite And I Am Wondering How To Support Interruptions During The Training (In Case The Instance Is Terminated By Aws). Is There Anything Already Built

If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?

3 years ago

0 Hello, I Tried The Clearml-Session Cli To Start A Jupyter Instance On An Agent, But An Error With The Password, Here Is The Full Cli Log:

But I see in the agent logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...

3 years ago

0 Hi, I Update Recently To Clearml-Server 1.2 (Self Hosted), Great Job! I Am Seeing The Popup Asking For S3 Creds Often When Navigating In Debug Samples. I Set Them Multiple Times Under Settings > Configuration > Web App Cloud Access, But For Some Reason It

Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone

2 years ago

0 Hi There, I Just Updated Clearml-Server To 1.8.0 And I See The Following But In The Comparison Of Scalars: All The Graphs Are Compressed To The Left When The Experiment Name Is Too Long In The Legend. I Will Now Try In 1.7.0 (It Was Not The Case In 1.6.0)

AgitatedDove14 It was only on comparison as far as I remember

2 years ago

0 Hi, Kudos For The 0.15 Guys! I Am Having An Issue Related To Git Auth: I Have An Issue With Trains-Agent (0.15): It Does Not Use Git Creds While Trying To Clone A Private Repo:

I am not sure what you mean by unless the domain is different ? Personal Access Token are designed such that to allow cloning a private repo, the user has to give the PAT full access to repos, including public repos. So it should also work with all other git repos

4 years ago

0 Hi Guys, Since I Am Done With Implementing The Aws Autoscaler, I Would Like To Share Some Pain Points That I Encountered In The Process With The Hope That They Can Be Documented To Help Other Users:

Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR 🙏

3 years ago

0 Hello, I Am Getting `Valueerror: Could Not Get Access Credentials For '

AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4 and the ImportError disappeared 🙂

4 years ago

Still the same problem 😞

3 years ago

0 Hi There, I Used

So I guess the problem is that the following snippet:
from clearml import Task Task.init()Should be added before the if __name__ == "__main__": ?

2 years ago