JitteryCoyote63

214 Questions, 1021 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

979 × Eureka!

Questions 214
Answers 1021

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hi, Together With

Hi, Together with ElegantKangaroo44 we found two unexpected behaviors in task.models['output'] : The input model of the task is included in the list The best...

clearml

4 years ago

0 Votes

8 Answers

943 Views

0 Votes 8 Answers 943 Views

Hi! I Have A Question Regarding Performances Of The Clearml-Server: Are The Calls From The Agents Made Asynchronously/In A Non Blocking Separate Thread? Is The Connection To The Clearml-Server Expected To Be A Bottleneck If The Clearml-Server Is Far From

Hi! I have a question regarding performances of the clearml-server: are the calls from the agents made asynchronously/in a non blocking separate thread? is t...

clearml

3 years ago

0 Votes

17 Answers

1K Views

0 Votes 17 Answers 1K Views

Hello, I Am Trying To Retrieve A Simple Dict Artifact Uploaded In A Previous Task With

Hello, I am trying to retrieve a simple dict artifact uploaded in a previous task with task.upload_artifact("my_dict", dict(foo="bar")) in a second task. I t...

clearml

4 years ago

0 Votes

9 Answers

1K Views

0 Votes 9 Answers 1K Views

Hi, I Want To Upgrade Clearml Server From 1.1 To 1.2 (Self Hosted). I Have The Following Setup:

Hi, I want to upgrade clearml server from 1.1 to 1.2 (self hosted). I have the following setup: /dev/nvme0n1p1 30G 21G 8.9G 70% / <- This is where /opt/clear...

clearml

2 years ago

0 Votes

5 Answers

960 Views

0 Votes 5 Answers 960 Views

Hello, I Have A Small Question Regarding Ui: Currently, In The Artifacts Section Of A Task, The

Hello, I have a small question regarding UI: Currently, in the artifacts section of a task, the FILE PATH displayed for artifacts stored in s3 are displayed ...

clearml

4 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hi, Some Properties Of The Task Object Are Not Listed In The Documentation (Such As Task.Parent, Which Is Not Clear Whether It Is The Parent Task Object Itself Or The Id Of The Parent Task).

Hi, some properties of the Task object are not listed in the documentation (such as task.parent, which is not clear whether it is the parent task object itse...

clearml

4 years ago

0 Votes

12 Answers

1K Views

0 Votes 12 Answers 1K Views

Hey, Often I Want To Compare Scalars Of Two Experiments With The Same Name But With Different Tags. In The Scalars Comparison Tab, I Cannot See Which Experiment Is Which Because I Don’T See The Tags. Usually, I Rename The Experiments So That I Can Identif

Hey, often I want to compare scalars of two experiments with the same name but with different tags. In the SCALARS comparison tab, I cannot see which experim...

clearml

3 years ago

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Hi, From Within An Experiment, How Can I Intercept The Signal That The Experiment Was Aborted And Execute A Cleanup Function? I Tried To Intercept Sigint And Sigterm, Unsuccessfully:

Hi, from within an experiment, how can I intercept the signal that the experiment was aborted and execute a cleanup function? I tried to intercept SIGINT and...

clearml

2 years ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hi, I Face A Strange Behavior From The Clearml-Agent: It’S Running In Services Mode, Not In Docker Mode, Cpu Only. I Want To Execute Two Tasks On This Service Agent. One Works, The Other Always Fails After Being Enqueued And Picked By The Agent With The E

Hi, I face a strange behavior from the clearml-agent: it’s running in services mode, not in docker mode, cpu only. I want to execute two tasks on this servic...

mlops

3 years ago

0 Votes

4 Answers

1K Views

0 Votes 4 Answers 1K Views

Hey, I Have One Question Regarding The Cleanup_Service Task In The Devops Project: Does It Assume That The Agent In Services Mode Is In The Trains-Server Machine?

Hey, I have one question regarding the cleanup_service task in the DevOps project: Does it assume that the agent in services mode is in the trains-server mac...

mlops

4 years ago

0 Votes

16 Answers

1K Views

0 Votes 16 Answers 1K Views

Got Some Errors While Running Migration Script From Es5 To Es7:

Got some errors while running migration script from ES5 to ES7: 2020-08-11 15:21:50,130 Running on: Linux 2020-08-11 15:21:50,227 Docker allocated memory: 16...

clearml

4 years ago

0 Votes

13 Answers

1K Views

0 Votes 13 Answers 1K Views

Hi, I Update Recently To Clearml-Server 1.2 (Self Hosted), Great Job! I Am Seeing The Popup Asking For S3 Creds Often When Navigating In Debug Samples. I Set Them Multiple Times Under Settings > Configuration > Web App Cloud Access, But For Some Reason It

Hi, I update recently to clearml-server 1.2 (self hosted), great job! I am seeing the popup asking for s3 creds often when navigating in debug samples. I set...

clearml

2 years ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Looks Like Trains-Agent 0.16

Looks like trains-agent 0.16 doesn't support --install-globally documented parameter -> Only available for trains-agent build command. Would it be possible t...

clearml

4 years ago

0 Votes

16 Answers

951 Views

0 Votes 16 Answers 951 Views

Hey, I Have A Problem With The Following Task:

Hey, I have a problem with the following task: def main(args): config = yaml.load(open(args.config)) if __name__ == '__main__': parser = argparse.ArgumentPar...

clearml

4 years ago

0 Votes

12 Answers

1K Views

0 Votes 12 Answers 1K Views

Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Hey there, since a bit I often find experiments being stuck while training a model. It seems to happen randomly and I could not find a reproducible scenario ...

mlops

2 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, I Have A Question Regarding The Aws_Autoscaler: It Usually Takes ~Hours To Get A Gpu Instance Nowadays. I Was Thinking, It Would Be Much More Interesting To Stop The Instances (Clearml-Agents) Instead Of Terminating Them Once They Are Inactive, So Tha

Hi, I have a question regarding the aws_autoscaler: It usually takes ~hours to get a GPU instance nowadays. I was thinking, it would be much more interesting...

mlops

2 years ago

0 Votes

4 Answers

1K Views

0 Votes 4 Answers 1K Views

The “Manage Queue” Option In The Right Tab On A Queued Experiment Is Broken In V1.0 (It Does Nothing)

The “Manage queue” option in the right tab on a queued experiment is broken in v1.0 (it does nothing)

clearml

3 years ago

0 Votes

17 Answers

1K Views

0 Votes 17 Answers 1K Views

Hi There, I Am Running A Clearml-Agent In Services Mode (With Docker) On A Machine With Two Disks: One With The Os (8Go, 91% Space Used) And One For The Data (100Go, 40% Space Used). When Executing The Auto-Scaler Task In This Agent, I Get The Following E

Hi there, I am running a clearml-agent in services mode (with docker) on a machine with two disks: one with the OS (8Go, 91% space used) and one for the data...

clearml

3 years ago

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Hi, I Would Like To Report Something Else Weird In The Clearml-Agent 1.5.1 Running In Docker Mode: In The Logs, When It Dumps Its Config, It Writes:

Hi, I would like to report something else weird in the clearml-agent 1.5.1 running in docker mode: In the logs, when it dumps its config, it writes: docker_c...

clearml

one year ago

0 Votes

4 Answers

958 Views

0 Votes 4 Answers 958 Views

Hey, I Would Like My Experiment To Call At Some Point A Cli Program Installed As A Dependency Of The Experiment. Here Is What I Do:

Hey, I would like my experiment to call at some point a CLI program installed as a dependency of the experiment. Here is what I do: myTask = Task.init(...) i...

clearml

4 years ago

0 Votes

10 Answers

1K Views

0 Votes 10 Answers 1K Views

Hey There, I Moved The Clearml S3 Bucket Where I Stored All My Clearml Data From One S3 Bucket To Another And Now I Realized That All The Models/Experiments Logged In The Clearml-Server Still Refer To The Old S3 Bucket. Is There A Way To Update All The Re

Hey there, I moved the clearml s3 bucket where I stored all my clearml data from one s3 bucket to another and now I realized that all the models/experiments ...

clearml

3 years ago

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Hey There, Would It Be Possible To Make Clearml-Agents Support Both Docker Mode And Venv Mode At The Same Time? Ie. Not Requiring To Be Restarted To Switch The Mode. The Mode Should Be Define On The Task Level: I Start An Experiment And Define Whether It

Hey there, Would it be possible to make clearml-agents support both docker mode and venv mode at the same time? Ie. not requiring to be restarted to switch t...

clearml

2 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hi Again, Is There A Way To Pass Secrets As Parameters Of A Task? I Have An Experiment That Requires Connecting To A Database, And I Need To Be Able To Pass The Creds As Task Params (Or In Another Way, I Don'T Know Yet). But I Don'T Want To Expose My Cred

Hi again, is there a way to pass secrets as parameters of a task? I have an experiment that requires connecting to a database, and I need to be able to pass ...

clearml

3 years ago

0 Votes

25 Answers

990 Views

0 Votes 25 Answers 990 Views

Hi, I Have Another Problem

Hi, I have another problem 😅 in one of my agent, one experiment started without torch using GPU. In the logs of the experiment shared below, we can see that...

clearml

4 years ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

Hello, I Am Getting `Valueerror: Could Not Get Access Credentials For '

Hello, I am getting ValueError: Could not get access credentials for ' s3://my-bucket ' , check configuration file ~/trains.conf but I did specify them in my...

clearml

4 years ago

0 Votes

5 Answers

941 Views

0 Votes 5 Answers 941 Views

Hi, I Have A Long Running Experiment That Was Running On Aws Instance That Got Killed After ~4 Days With The Following Reason:

Hi, I have a long running experiment that was running on AWS instance that got killed after ~4 days with the following reason: STATUS REASON: Forced stop (no...

clearml

2 years ago

0 Votes

2 Answers

928 Views

0 Votes 2 Answers 928 Views

Hi, Another Idea For Clearml Web Ui: In The Projects View, If I Have Several Experiments Being Enqueued And I Sort By “Started” Ascending (Newest On Top), I Expect To See Enqueued Experiments At The Very Top, While They Are Shown At The Very Bottom - Woul

Hi, another idea for ClearML web UI: in the projects view, if I have several experiments being enqueued and I sort by “Started” ascending (newest on top), I ...

clearml

3 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Hi, I Recently Updated Clearml-Server To 1.7 And I Am Getting A Lot Of The Following Errors Since Today On Any Experiment (I Didn'T Had This Error Before):

Hi, I recently updated clearml-server to 1.7 and I am getting a lot of the following errors since today on any experiment (I didn't had this error before): 1...

clearml

2 years ago

0 Votes

18 Answers

1K Views

0 Votes 18 Answers 1K Views

Hey There, I Would Like To Increase The

Hey there, I would like to increase the ulimit for the number of files opened at the same time in a ec2 instance. According to this https://stackoverflow.com...

clearml

3 years ago

0 Votes

13 Answers

988 Views

0 Votes 13 Answers 988 Views

Hello, In The Following Context:

Hello, in the following context: controller_task = Task.init(...) # This will clone the parent task, enqueue and wait for finished status data_processing_tas...

clearml

4 years ago

Show more results

0 Hi There,

Ok no it only helps if as far as I don't log the figures. If I log the figures, I will still run into the same problem

one year ago

0 Hi, I Am Trying To Use The Clearml-Agent In Docker Mode To Run An Experiment, But It Seems To Fail Passing The Clearml.Conf File To The Docker Container:

in my clearml.conf, I only have:
sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"

one year ago

0 Hi There,

Is it exactly agg or something different?

one year ago

0 Hi There,

Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak

one year ago

0 Hi, Did Anyone Experiment With Running On The Aws Autoscaler On Spots And Knows Whether There Is Configuration For Retry Policy When Spot Get Evacuated Mid-Job?

Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint

3 years ago

0 Hi There,

Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄

one year ago

0 Hi, I Would Like To Bring Awareness

RuntimeError: CUDA error: no kernel image is available for execution on the device

one year ago

0 Hi There,

For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:

import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm

task = Task.init("Debug memory leak", "reproduce")

def plot_data():
    fig, ax = plt.subplots(1, 1)
    t = np.arange(0., 5., 0.2)
    ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    return fig

for i in tqdm(range(1000), total=1000):
    fig = plot_data()
  ...

one year ago

0 Hi There,

Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False} seem to have a positive impact - it is running now, I will confirm in a bit

one year ago

0 Hi, I Updated To Clearml-Server 1.4.0 And I Am Uncomfortable With The New Table/Detail View, Is There A Way To Disable It And Use The Previous One (On Click -> Open Details)?

DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!

2 years ago

0 Hi There,

Update: I successfully isolated one of the reason, mem leak in matplotib itself, I opened an issue on their repo here

one year ago

0 Hi, I Would Like To Bring Awareness

oh seems like it is not synced, thank you for noticing (it will be taken care immediately)

Thank you!

does not contain a specific wheel for cuda117 to x86, they use the pip defualt one

Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing

one year ago

0 Hi There,

With a large enough number of iterations in the for loop, you should see the memory grow over time

one year ago

0 Hi There,

Adding back clearml logging with matplotlib.use('agg') , uses more ram but not that suspicious

one year ago

0 Hey, Clearml Team! When Can We Expect An Updated Roadmap? Last One Is From August

automatically promote models to be served from within clearml

Yes!

3 years ago

0 Hi, I Would Like To Bring Awareness

When running my training code

one year ago

0 Hi, I Would Like To Bring Awareness

So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)

one year ago

0 Hi, If I Am Starting My Training With The Following Command:

Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:

` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...

3 years ago

0 Hi, If I Am Starting My Training With The Following Command:

I fixed, will push a fix in pytorch-ignite 🙂

3 years ago

0 Hi, I Have Another Problem

AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?

4 years ago

0 Hello, In The Following Context:

That said, you might have accessed the artifacts before any of them were registered

I called task.wait_for_status() to make sure the task is done

4 years ago

0 Hi, I Have Another Problem

agent.cuda_version = 0 agent.cudnn_version = 0

4 years ago

0 Hello, In The Following Context:

Thanks AgitatedDove14 !
Could we add this task.refresh() on the docs? Might be helpful for other users as well 🙂 OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them

4 years ago

0 Hi, I Have Another Problem

agent.cuda_version = 110 agent.cudnn_version = 0

4 years ago

0 Hello, In The Following Context:

This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object

That sounds awesome! It will definitely fix my problem 🙂

In the meantime: I now do:
task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()But I still get KeyError: 'output' ... Was that normal? Will it work if I replace the second line with task.refresh () ?

4 years ago

0 Hi, I Have Another Problem

I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0

4 years ago

yes exactly

2 years ago

mmmh good point actually, I didn’t think about it

2 years ago

Because it lives behind a VPN and github workers don’t have access to it

2 years ago

I’d like to move to a setup where I don’t need these tricks

2 years ago

Show more results