AgitatedDove14

48 Questions, 8049 Answers

Active since 10 January 2023

Last activity 6 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8049

0 Hello! I'M Running Clearml-Server On Kubernetes, And It Seems My Models Are Not Really Saved. I See That Doing Task.Init(Output_Uri=True) Should Send Models To Fileserver. The Models Are Visible In The Ui But The Download Button Is Greyed Out And When I D

and if you add --skip-task-init ?
I think what happens is that the clearml-Task, adds a Task.init call without the output_uri that is called before "your" Task.init, and this is what causes it to be ignored. Could that be the case?

2 years ago

0 Question About The Usage Of Trains Agents. In Our Company We Have 3 Hpc Servers, Two Of Them Have Multiple Gpus, One Is Cpu Only. I Saw In The Docs The Multiple Agents Can Be Run Separately Assigning Gpus In Whatever Manner You Want. My Questions Are 1

WackyRabbit7 my apologies for the lack of background in my answer 🙂
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...

4 years ago

0 Hello, I'M Using Trains For Logging My Training Script. However, While Using The Logger I'M Getting This: Trains.Task - Warning - ### Task Stopped - User Aborted - Status Changed ### And Eventually The Process Is Killed. If I Disable The Logger, The Proc

Hi SoreDragonfly16
The warning you mention means that someone state of the experiment was changed to aborted , which in term will actually kill the process.
What do you mean by "If I disable the logger," ?

4 years ago

0 Hi, Kudos For The 0.15 Guys! I Am Having An Issue Related To Git Auth: I Have An Issue With Trains-Agent (0.15): It Does Not Use Git Creds While Trying To Clone A Private Repo:

Thank you! 🙂

4 years ago

0 Hi All, I Am Testing The New

Okay, so the idea behind the new decorator is not to group all the defined steps under the same script so that they share the same environment, but rather to simplify the process of creating scripts for each step and avoid manually calling

Task.init

on those scripts.

Correct, and allow users to more easily create Tasks from code.

Regarding virtual environment creation from caching, I will keep running benchmarks (from what you say it might be due to high workload ...

3 years ago

0 Hi Guys, Since I Am Done With Implementing The Aws Autoscaler, I Would Like To Share Some Pain Points That I Encountered In The Process With The Hope That They Can Be Documented To Help Other Users:

No worries, just wanted to make sure it doesn't slip away 😉

3 years ago

0 Hey Folks, When I Run

seems okay to me

3 years ago

0 Hello! How To Determine The Cache For An Agent In Kubernetes? I'M Going To Mount S3 As A Cache Folder As A Local Path Using S3Fs. What Variable Needs To Be Set In Values.Yaml For Agent Helm Chart?

Hi @<1578555761724755968:profile|GrievingKoala83>

mount s3 as a cache folder

I'm not sure that would be fast enough for cache ...

How to override

/root/.cache/pip

path?

in your clearml.conf fille:
None
then set it to your PV

5 months ago

0 Hey All. Quick Question About The

ClumsyElephant70 the odd thing is the error here:
docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash )?

3 years ago

0 Hi Guys, Probably Is Just Me Missing Something Along The Way:

Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py

3 years ago

0 Hey, Somehow

DeliciousSeal67 the agent will use the "install packages" section in order to install packages for the code. If you clear the entire section (you can do that in the UI or programmatically) then it will revert to requirementsd.txt
Make sense ?

2 years ago

0 I Wanted To Ask, I'M Versioning My Data Using Clearml Data. And I'Ll Have A Training Task With Clearml Task. My Question Is, Does Clearml Keep Track Of The Data Versions Fetched From Clearml Data? Basically I Want To See How Much Of Tracking And Informati

VexedCat68 actually a few users already suggested we auto log the dataset ID used as an additional configuration section, wdyt?

2 years ago

0 Bug?

I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?

one year ago

0 Help Please, After Creating My Data Drift Monitoring Dashboard Using Clearml Serving And Grafana, How Can I Configure My Alerts To Be Notified When The Distribution Of My Metrics (Variables) Changes On My Heatmaps?

try to break it into parts and understand what produces the error
for example:
increase(test12_model_custom:Glucose_bucket[1m])
increase(test12_model_custom:Glucose_sum[1m])
increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m])
and so on

5 months ago

0 Hi

The problem is of course filling in all the configuration details, so that they are viewable.
Other than that, check out:
https://allegro.ai/docs/task.html#trains.task.Task.export_task
https://allegro.ai/docs/task.html#trains.task.Task.import_task
Sounds good ?

3 years ago

0 Hii Guys, So I'Ve Got A Question About About Agents Using Ssh Connection. In The Docs (Here

Hmm can you run the agent in debug mode, and check the specific console log?
'''
clearml-agent --debug daemon --foreground ...

one year ago

0 One More Follow-Up Still; We'Re Trying To Run Non-Gpu Scaler, And I'Ve Finally Sorted Out Subnet And Security Groups Issues, Only To Run Into This:

But it should work out of the box ...

Yes it should ....

The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.

Can you manually successfully run:
git clone --recursive https://user:token@github.com/company/repo_with_submodules

2 years ago

0 Hi New With Clearml I Create Clearml Server On Gcp With Docker Now I’M Training Yolov5 And I Want To Save All The Info (Model And Metrics ) With Clearml To My Bucket.. (So I Can Have Small Server And No Memory Issue ) Where Should I Start? Its Should Be C

And you are seeing a bunch of the GS SSL errors?

one year ago

0 Hello There, I Am Trying To Organize The Dl Code Into A Monorepo, The Repo Will Have A Section Of Shared Packages That Will Be Used By Other Packages That Are The Actual Training Projects. Let'S Say That I Install The Shared Libs With Pip In Editable Mod

Hi SkinnyPanda43

Let's say that I install the shared libs with pip in editable mode on my development evironment, how does the clearml-agent will handle those libraries if I submit a job

So installing packages from local folders with "-e" is in general ill-advised.
But using a full git path should work out of the box. for example if you install pip install https://github.com/user/repo/repo.git then the agent will be able to install it on the remote machine. The main challenge...

2 years ago

0 Hi, I’M Getting This Error When I Try To Run Task On A Remote Agent With Docker Mode Web Ui:

but why is it mounted only once?

Are you saying the second time this line is missing? this is very strange...
Can you send the full Task log?

2 years ago

0 Another Question: How Can I Make Clearml-Agent Use Pre-Installed Version From The Nvidia/Pytorch (

Why can we even change the pip version in the clearml.conf?

LOL mistakes learned the hard way 🙂
Basically too many times in the past pip versions were a bit broken, which is fine if they are used manually and users can reinstall a diff version, but horrible when you have an automated process like the agent, so we added a "freeze version" option, only with greater control. Make sense ?

2 years ago

0 Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Then we can figure out what can be changed so CML correctly registers process failures with Hydra

JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?

If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never cal...

2 years ago

Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)

2 years ago

0 Hey.

Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?

Basically you should have the docker service login before running the agent, then the agent uses docker to run the image from the ECR.
Make sense ?

2 years ago

0 Hi All. I Am Struggling With Integrating Plots Into My Task. Without The Plotting Code, The Task Never Completes The Execution And Seems To Hang. Also, The Plots Are Not Visible In The Plots Tab. I Am Running A For Loop For Different Models And Attemptin

Hi TenseOstrich47 whats the matplotlib version and clearml version you are using ?

3 years ago

0 What Is The Suggested Way Of Running Trains-Agent With Slurm? I Was Able To Do A Very Naive Setup: Trains-Agent Runs A Slurm Job. It Has The Disadvantage That This Slurm Job Is Blocking A Gpu Even If The Worker Is Not Running Any Task. Is There An Easy Wa

Hi HealthyStarfish45
Funny just today I had a similar discussion on slurm:
https://allegroai-trains.slack.com/archives/CTK20V944/p1603794531453000

Anyhow, when you say "[scale up agents]" are you referring to a machine constantly running an agent pulling jobs from the queue, where the machine itself (aka the resource) is managed as a slurm job?

3 years ago

0 Greetings Everyone, In The Course Of My Work, I Utilize A Particular Library That Necessitates More Than Just A Simple Clone And Dependency Installation Procedure. It Also Requires The Cloning Of An Additional Repository, Along With Its Installation, And

Thanks a lot. I meant running a bash script after cloning the repository and setting the environment

Hmm that is currently not supported 😞
The main issue in adding support is where to store this bash script...

Perhaps somewhere inside clear ml there is an order of actions for starting that can be changed?

Not that I can think of,
but let's assume you could have such a thing, what would you have put in the bash script (basically I want to see maybe there is a worka...

one year ago

0 Assuming I Have A

(without having to execute it first on Machine C)

Someone some where has to create the definition of the environment...
The easiest to go about it is to execute it one.
You can add to your code the following line
task.execute_remotely(queue_name='default')This will cause you code to stop running and enqueue itself on a specific queue.
Quite useful if you want to make sure everything works, (like run a single step) then continue on another machine.
Notice that switching between cpu...

4 years ago

0 Another Question Is If I Have A Conda Env Available On My Workers Systemwide.. Can I Use That Env Directly When Running Tasks With

hmm... try to run the trains-agent from the ml environment with "system_site_packages: true", it might do the trick. Anyhow please let me know if it worked 🙂

4 years ago

0 Is Trains Adaptable For Federated Learning Scenarios?

With pleasure 🙂

4 years ago

Show more results