JitteryCoyote63

214 Questions, 1021 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

979 × Eureka!

Questions 214
Answers 1021

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hi, In One Of My Agents With Cuda Version: 11.1 (From Nvidia-Smi), Clearml Agent 0.17.1 Detects Version 100 (I Can See From Experiments Logs:

Hi, in one of my agents with CUDA Version: 11.1 (from nvidia-smi), clearml agent 0.17.1 detects version 100 (I can see from experiments logs: agent.cuda_vers...

mlops

3 years ago

0 Votes

1 Answers

993 Views

0 Votes 1 Answers 993 Views

Hi, I Encounter The Following Bug With Clearml 0.17.5Rc2: When I Start A Task Locally And That Task Raises Cuda Out Of Memory, The Command Returns But The Process Is Not Killed, And Therefore The Gpu Ram Is Not Freed

Hi, I encounter the following bug with clearml 0.17.5rc2: When I start a task locally and that task raises cuda out of memory, the command returns but the pr...

clearml

3 years ago

0 Votes

11 Answers

975 Views

0 Votes 11 Answers 975 Views

Hi Guys, Following Up On This

Hi guys, following up on this https://allegroai-trains.slack.com/archives/CTK20V944/p1599135173096200?thread_ts=1599125260.076600&cid=CTK20V944 : I have a pi...

clearml

4 years ago

0 Votes

14 Answers

1K Views

0 Votes 14 Answers 1K Views

Hi, When I Use Task.Get_Logger().Report_Table, I Go The Ui After The Experiment Finishes And I Download The Table (Under Results > Plots), It Gives Me A Json File. How Can I Use It? It Seems To Follow A Structure Specific To Clearml, How Can I For Example

Hi, when I use task.get_logger().report_table, I go the UI after the experiment finishes and I download the table (under RESULTS > PLOTS), it gives me a json...

clearml

3 years ago

0 Votes

4 Answers

1K Views

0 Votes 4 Answers 1K Views

Btw I Saw A Bug In The Web Ui That Is Rather Frustrating: When I Add Some Metric Columns To A Project Page, If I Refresh The Page Manually With F5, All The Changes I Made On The Columns Are Rolled-Back, As If They Were Not Saved. Same Happens With The Res

Btw I saw a bug in the web UI that is rather frustrating: When I add some metric columns to a project page, if I refresh the page manually with F5, all the c...

clearml

2 years ago

0 Votes

26 Answers

1K Views

0 Votes 26 Answers 1K Views

Hi, I Attached An Iam Role To An Ec2 Instance To Grant Access To An S3 Bucket. The Ec2 Instance Is Running A Clearml-Agent (V1.1.0). I Didn’T Specify Any Key/Secret For Clearml. The Tasks Fail With The Following Error:

Hi, I attached an IAM role to an ec2 instance to grant access to an s3 bucket. The ec2 instance is running a clearml-agent (v1.1.0). I didn’t specify any key...

aws

3 years ago

0 Votes

10 Answers

1K Views

0 Votes 10 Answers 1K Views

Hi, Another Bug To Report With The Aws_Auto_Scaler Using 1.1.2:

Hi, another bug to report with the aws_auto_scaler using 1.1.2: Traceback (most recent call last): File "aws_autoscaler.py", line 297, in main() File "aws_au...

mlops

3 years ago

0 Votes

1 Answers

910 Views

0 Votes 1 Answers 910 Views

Hi There, Would It Be Possible To Add Some Neural Architecture Search Example, As For The Hyperparameter Optimizer Examples?

Hi there, would it be possible to add some Neural Architecture Search example, as for the HyperParameter Optimizer examples?

clearml

3 years ago

0 Votes

4 Answers

992 Views

0 Votes 4 Answers 992 Views

Hi, What Happens Exactly When I Execute The Following Command:

Hi, what happens exactly when I execute the following command: trains-agent daemon --gpus 0 --queue default &In my code, how to know which GPU to choose insi...

clearml

4 years ago

0 Votes

13 Answers

1K Views

0 Votes 13 Answers 1K Views

Hi, I Update Recently To Clearml-Server 1.2 (Self Hosted), Great Job! I Am Seeing The Popup Asking For S3 Creds Often When Navigating In Debug Samples. I Set Them Multiple Times Under Settings > Configuration > Web App Cloud Access, But For Some Reason It

Hi, I update recently to clearml-server 1.2 (self hosted), great job! I am seeing the popup asking for s3 creds often when navigating in debug samples. I set...

clearml

2 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hi, I Am Using The Aws Autoscaler And Getting The Following Error While Trying To Spin Up Spot Instances:

Hi, I am using the aws autoscaler and getting the following error while trying to spin up spot instances: 2021-08-16 17:18:48 Spinning new instance type=v100...

aws mlops

3 years ago

0 Votes

1 Answers

953 Views

0 Votes 1 Answers 953 Views

The Markdown Editor For Documenting Projects Is So Great, I Love It

The Markdown editor for documenting projects is so great, I love it 🤩

clearml

3 years ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Hi Guys, Last Night One Of Our Agents (0.16.1) Was Disconnected From Our Trains-Server While Executing An Experiment. I Saw That Because The Experiment It Was Running Had The Status Aborted And I Could Not See The Agent In The List Of Available Workers. H

Hi guys, Last night one of our agents (0.16.1) was disconnected from our trains-server while executing an experiment. I saw that because the experiment it wa...

mlops

4 years ago

0 Votes

1 Answers

966 Views

0 Votes 1 Answers 966 Views

Hi, Would It Be Possible To Parse Torch Requirement When It’S Part Of The Extras_Require Dict? In My Code, I Have The Following:

Hi, would it be possible to parse torch requirement when it’s part of the extras_require dict? In my code, I have the following: train_task._update_requireme...

mlops

3 years ago

0 Votes

18 Answers

971 Views

0 Votes 18 Answers 971 Views

Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

Hello there, I would like to do run cleanup code in case the user aborts one task from the dashboard (the agent is not using the task in docker). What signal...

mlops

4 years ago

0 Votes

18 Answers

1K Views

0 Votes 18 Answers 1K Views

Hi, I Just Updated Clearml Server 1.0 Using

Hi, I just updated clearml server 1.0 using docker-compose down & docker-compose pull & docker-compose up -d , it worked ant it looks amazing! I found two pr...

clearml

3 years ago

0 Votes

1 Answers

1K Views

0 Votes 1 Answers 1K Views

Hi, I Have A Question About

Hi, I have a question about https://clear.ml/docs/latest/docs/references/sdk/logger#report_scatter3d : Would it be possible to pass a matplotlib figure in 3d...

clearml

2 years ago

0 Votes

2 Answers

635 Views

0 Votes 2 Answers 635 Views

Hi All, How Can I Have A Global Variable Used In A Pipeline Step? I Have To Define Them In Each Pipeline Step, Otherwise They Are Not Included In The Pipeline Step

Hi all, how can I have a global variable used in a pipeline step? I have to define them in each pipeline step, otherwise they are not included in the pipelin...

clearml

8 months ago

0 Votes

5 Answers

953 Views

0 Votes 5 Answers 953 Views

Hello, I Have A Small Question Regarding Ui: Currently, In The Artifacts Section Of A Task, The

Hello, I have a small question regarding UI: Currently, in the artifacts section of a task, the FILE PATH displayed for artifacts stored in s3 are displayed ...

clearml

4 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hi, similar to Task.set_offline(True), is there a way to simulate an execution in an agent? (for testing purposes)

clearml

2 years ago

0 Votes

29 Answers

1K Views

0 Votes 29 Answers 1K Views

Hi, Although

Hi, although https://github.com/allegroai/clearml/issues/181 is resolved, clearml-agent (0.17.2) still logs tqdm iterations as different lines, is there some...

clearml

3 years ago

0 Votes

5 Answers

977 Views

0 Votes 5 Answers 977 Views

Hi There, I Would Like To Report A Bug With The Resizing Of The Columns In The Projects View: It Doesn’T Work As Expected. Please Look At The Behavior Of The Resizing On The Following Screen Recording

Hi there, I would like to report a bug with the resizing of the columns in the projects view: it doesn’t work as expected. Please look at the behavior of the...

clearml

3 years ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hello, I Would Like To Use Spot Instances Together With The Aws Autoscaler To Train Models With Pytorch/Ignite And I Am Wondering How To Support Interruptions During The Training (In Case The Instance Is Terminated By Aws). Is There Anything Already Built

Hello, I would like to use spot instances together with the AWS autoscaler to train models with pytorch/ignite and I am wondering how to support interruption...

mlops

3 years ago

0 Votes

4 Answers

979 Views

0 Votes 4 Answers 979 Views

Hi Guys, I Got A Very Unexpected Error Today On In One Of My Agents:

Hi guys, I got a very unexpected error today on in one of my agents: ... Collecting tqdm Using cached tqdm-4.48.2-py2.py3-none-any.whl (68 kB) Processing /ro...

clearml

4 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hi, I Cannot Manage To Start Trains-Server 0.16 With The Docker-Compose File, The Trains-Elastic Container Fails With The Following Error:

Hi, I cannot manage to start trains-server 0.16 with the docker-compose file, the trains-elastic container fails with the following error:

clearml

4 years ago

0 Votes

2 Answers

646 Views

0 Votes 2 Answers 646 Views

Hi There, I Have Several Experiments Hanging/Stuck In The Middle Or At The End Of The Training, With The Last Message Logged Being:

Hi there, I have several experiments hanging/stuck in the middle or at the end of the training, with the last message logged being: train INFO: Engine run co...

clearml

7 months ago

0 Votes

0 Answers

1K Views

0 Votes 0 Answers 1K Views

Hi All, Would It Be Possible To Make The Aws Autoscaler Log Each Scale In/Out Operation In The Console To Help Debugging/Understanding The Course Of Events?

Hi all, Would it be possible to make the aws autoscaler log each scale in/out operation in the console to help debugging/understanding the course of events?

aws mlops

3 years ago

0 Votes

17 Answers

1K Views

0 Votes 17 Answers 1K Views

Hi There, I Am Running A Clearml-Agent In Services Mode (With Docker) On A Machine With Two Disks: One With The Os (8Go, 91% Space Used) And One For The Data (100Go, 40% Space Used). When Executing The Auto-Scaler Task In This Agent, I Get The Following E

Hi there, I am running a clearml-agent in services mode (with docker) on a machine with two disks: one with the OS (8Go, 91% space used) and one for the data...

clearml

3 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hi There,

Hi there, I found a memory leak in Logger.report_matplotlib_figure . I was constantly running out of memory when training my models so I decided to spend som...

clearml

one year ago

0 Votes

2 Answers

921 Views

0 Votes 2 Answers 921 Views

Hey There

Hey there 🙂 Still my journey to deploy the aws-autoscaler with spot instances, I have another question: I would like to limit the amount of time spent setti...

mlops

3 years ago

Show more results

0 Hi, I Restarted My Clearml-Server (1.1.0) And The Login Page Always Redirects Me To The Login Page. I Am Using Fixed Users In Config Files. In The Logs Of The Api Server I Can See:

Here is the console with some errors

3 years ago

0 Hi, I Restarted My Clearml-Server (1.1.0) And The Login Page Always Redirects Me To The Login Page. I Am Using Fixed Users In Config Files. In The Logs Of The Api Server I Can See:

Yes, I set:
auth { cookies { httponly: true secure: true domain: ".clearml.xyz.com" max_age: 99999999999 } }It always worked for me this way

3 years ago

0 Hi, I Restarted My Clearml-Server (1.1.0) And The Login Page Always Redirects Me To The Login Page. I Am Using Fixed Users In Config Files. In The Logs Of The Api Server I Can See:

SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain setting - I did it, restarted and now it works 🙂 Thanks!

3 years ago

0 Hi There, I Am Trying To Start An Agent In Services Mode With Trains-Server Being On Localhost (But Not Started Together With The Docker-Compose!). My Trains.Conf Is The Following:

with what I shared above, I now get:
docker: Error response from daemon: network 'host' not found.

4 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

mmh it looks like what I was looking for, I will give it a try 🙂

4 years ago

0 Hi, A Small Bug (Not Really A Bug) In The Autoscaler: I Have P3.2Xlarge Instances That Take A Long Time To Shutdown. With

Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print

3 years ago

0 Hi There, I Am Running A Clearml-Agent In Services Mode (With Docker) On A Machine With Two Disks: One With The Os (8Go, 91% Space Used) And One For The Data (100Go, 40% Space Used). When Executing The Auto-Scaler Task In This Agent, I Get The Following E

with the CLI, on a conda env located in /data

3 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed

4 years ago

0 Hi, When I Use Task.Get_Logger().Report_Table, I Go The Ui After The Experiment Finishes And I Download The Table (Under Results > Plots), It Gives Me A Json File. How Can I Use It? It Seems To Follow A Structure Specific To Clearml, How Can I For Example

cool, thanks AgitatedDove14 !

3 years ago

0 Hi, I Am Using Clearml With Pytorch-Ignite And Its Earlystopping Handler. I Would Like To Log The Counter Of The Patience Of This Handler, How Can I Do That?

Hi AgitatedDove14 , I don’t see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?

3 years ago

0 Hi, Some Properties Of The Task Object Are Not Listed In The Documentation (Such As Task.Parent, Which Is Not Clear Whether It Is The Parent Task Object Itself Or The Id Of The Parent Task).

The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]

4 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent

4 years ago

0 Hi, Some Properties Of The Task Object Are Not Listed In The Documentation (Such As Task.Parent, Which Is Not Clear Whether It Is The Parent Task Object Itself Or The Id Of The Parent Task).

GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)

4 years ago

0 Hey Guys, I Am Setting Up A New Machine With Two Rtx 3070 Gpus Where I Created Two Agents (One For Each Gpu). On Both Agents, My Experiments Fail With Error:

(I use trains-agent 0.16.1 and trains 0.16.2)

4 years ago

0 Hi There, I Am Trying To Start An Agent In Services Mode With Trains-Server Being On Localhost (But Not Started Together With The Docker-Compose!). My Trains.Conf Is The Following:

I am now trying with agent.extra_docker_arguments: ["--network='host'", ] instead of what I shared above

4 years ago

0 Hi There, I Am Trying To Start An Agent In Services Mode With Trains-Server Being On Localhost (But Not Started Together With The Docker-Compose!). My Trains.Conf Is The Following:

AgitatedDove14 I finally solved it: The problem was --network='host' should be --network=host

4 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

yes, exactly 🙂

4 years ago

I am doing:
try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)

3 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent

4 years ago

0 Hi Again, It Seems Like The Aws Autoscaler Is Not Spinning Instances With The Ebs Configuration I Configured. Here Is The Configuration:

So I changed ebs_device_name = "/dev/sda1" , and now I correctly get the 100gb EBS volume mounted on / . All good 👍

3 years ago

0 Hi Guys; Another Idea: Would Be Very Cool To Have A Mattermost Alert (Monitor Task), Just Like The One For Slack. Have A Nice Week-End All

Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me 😄 to learn go

3 years ago

Ok, I got the following error when uploading the table as an artifact:
ValueError('Task object can only be updated if created or in_progress')

3 years ago

0 Hi, I Would Like To Report Another Bug Introduced With Clearml-Server 1.2.0: In The Comparison Page Of Two Experiments, On The Scalar Tab, With The Graph Layout, When Clicking On The Eye On One Scalar Group To Hide The Related Graphs, The Later Do Disappe

2 years ago

0 Hi Again, It Seems Like The Aws Autoscaler Is Not Spinning Instances With The Ebs Configuration I Configured. Here Is The Configuration:

Yes AgitatedDove14 🙂

3 years ago

0 Hello There, I Would Like To Do Run Cleanup Code In Case The User Aborts One Task From The Dashboard (The Agent Is Not Using The Task In Docker). What Signal Should I Listen For In The Task?

using Trains Agent 🙂

4 years ago

0 Hi Again, Is There A Way To Pass Secrets As Parameters Of A Task? I Have An Experiment That Requires Connecting To A Database, And I Need To Be Able To Pass The Creds As Task Params (Or In Another Way, I Don'T Know Yet). But I Don'T Want To Expose My Cred

Thanks for your input TenseOstrich47 , I was considering using a secret manager now, I guess that's the best option. I can move the secrets wherever I need them to be to make it work 🙂

3 years ago

0 Hi Again, I Am Trying To Make The Aws Autoscaler Work With Ec2 Instances, But It Fails To Setup The Agent In The Machine: The Logs Of The User-Data Script Show That It Fails Updating The Machine (See Below)

AMI ami-08e9a0e4210f38cb6 , region: eu-west-1a

3 years ago

the deep learning AMI from nvidia (Ubuntu 18.04)

3 years ago

so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...

3 years ago

there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails

3 years ago

Show more results