AgitatedDove14

48 Questions, 8051 Answers

Active since 10 January 2023

Last activity 7 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8051

0 Hi, I'M Using The Autoscaler And Getting The Error

Hi CloudySwallow27

This error occurs randomly during training (in other words training does successfully start).

What's the cleamrl-agent version you are using, and the clearml version ?

2 years ago

0 Hi Everyone! Quick Question: I Have A Script That Allows The Model To Be Saved Out In Case Of An Early Exit. At The Moment The Script Is Catching The Sigint And Sigterm Signals, Ending The Training And Writing Out The Model. I Understand I Could Use Check

SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?

4 years ago

0 Hi! I Have Local Minio Setup, Via Minio Browser I Can Upload 50-100 Mb Per Second As Its Local. But When I Try To Use Task.Upload_Artifact It Uploads 500 Kb Per Second. Does Anyone Have An Idea About This?

upload_artifact will actually do two things:
upload the file to the trains-server register it as an artifact on the experiment
What did you mean by "register the artifact manually"? You still need to upload the file to the trains-server (so it is later accessible )

4 years ago

0 I'M Having Issues Running Trains-Agent On My Aws, It Seems To Not Be Able To Install Pytorch... I Have

PunySquid88 do you want to test a fix?

4 years ago

Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?

4 years ago

0 Is There Any Specific Version Of Numpy You Recommend To Use With Clearml Python Library? I Am Building An Python Alpine Docker Image With Clearml==1.7.2 But It Breaks When Building Image From Dockerfile.

Do you have a specific numpy version you are installing ? why is it trying to install the wheel from code?

2 years ago

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

maybe I should use explicit reporting instead of Tensorboard

It will do just the same 😞

there is no method for setting

last iteration

, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

Let me double check that...

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...

That is a very good point

but for the metrics, I explicitly pass th...

3 years ago

0 Hi, I Am Trying To Pull Api Data From /Tasks.Get_All Endpoint

DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?

2 years ago

None of them is problematic, this is what I'm trying to say 🙂
I think the minio browser gets confused.
if you want to test the upload time on the client you can try:
task.flush(wait_for_uploads=True) tic = time() task.upload_artifact('test', '/tmp/localfile') task.flush(wait_for_uploads=True) print(time() - tic)

4 years ago

0 Hello! There Is Great Alternative For Argparse Developed By Facebook For Ml Named

GrievingTurkey78 Actually it is in progress, see the GitHub issue for details:
https://github.com/allegroai/trains/issues/219

4 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Thanks SparklingHedgehong28
So I think I'm missing information on what you call "Instance protection" ?
You mean like respining spot instances ? or is it away to review the performance of AWS ASG (i.e. like a watchdog of a sort) ?

2 years ago

0 Hi, I'M Trying To Clone And Queue Experiments For Running Them On My Workers. I Am Able To Successfully Clone And Queue The Task, But Seems Like The Task Does Not Pass The Correct Parameters To My Python Script On The Worker. We Use Hydra For Configuring

Oh no, you are absolutely correct, it is broken (I mean I have no idea why it lists Hydra, or how it got there). I will let the guys know and fix it.
Bottom line, after you clone it, please edit the installed packages and remove the "Hydra" line and replace with just "hydra-core" (no need for version).
The format is the same as "requirements.txt" and will effect the venv created by the agent

2 years ago

0 I'M Having Issues Running Trains-Agent On My Aws, It Seems To Not Be Able To Install Pytorch... I Have

With pleasure, I'll make sure we officially release RC1 soon :)

4 years ago

I think I found something, let me dig deeper 🙂

4 years ago

0 I'M Having Issues Running Trains-Agent On My Aws, It Seems To Not Be Able To Install Pytorch... I Have

I see the problem now: conda is failing to install the package from the git, then it reverts to pip install, and pip just fails... " //github.com/ajliu/pytorch_baselines "

4 years ago

Wait, it shows "hydra==2.5" not "hydra-core==x.y" ?

2 years ago

My pleasure

2 years ago

0 Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Hi JumpyPig73 , I think it was synced to github. You can already test with: git install git+ https://github.com/allegroai/clearml.git

2 years ago

0 Getting An Odd Error When Trying To Open My Plots (See Picture Attached) Also, Not Able To Save Any Plots To Trains

It seems something is wrong with the server itself...

4 years ago

0 I'M Having Issues Running Trains-Agent On My Aws, It Seems To Not Be Able To Install Pytorch... I Have

Please send the full log, I just tested it here, and it seems to be working

4 years ago

I'm getting:
hydra_core == 1.1.1What's the setup you have? python version, OS, Conda yes/no?

2 years ago

0 Hello! I Was Hoping I Could Get Some Debug Help. I'Ve Set Up A Clearml Pipeline Using The Pipelinecontroller, And When Running Through

It just seems frozen at the place where it should be spinning up the tasks within the pipeline

And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)

2 years ago

0 Is There A Way To Generate Usage Stats And Reports For Queues? For Example, How Often Is A Queue Used, How Much Cpu Does

Unfortunately not, the queues tab shows only the number of tasks, but not resources used

in the queue

Oh, yes, that makes sense to add, I like that 🙂
(the main question is what data is there in the backend DBs, let me know what I can get)

2 years ago

0 Hello Everyone! I'M Currently Trying To Set Up A Pipeline, And Am A Bit Confused At A Few Things. Some Questions I Have:

SteadySeagull18 btw: in post-callback the node.job will be completed
because it is a called after the Task is completed

2 years ago

0 Hello! Is There A Way To Avoid Or Accelerate

With default settings, to upload 2 datasets of 120 GB and 70 Gb it took more than 6 hours!

SmugSnake6 at the end s the an outcome of limited bandwidth or limited CPU ?

2 years ago

0 I Am Using Opennmt-Tf (2.18.1) And Clearml (1.1.2) For Training And Testing My Translation Models. I Am Wanting To Register The Incremental Bleu Scores And Final Test Data With Clearml (For Plotting, Comparison, Etc.), But It Is Not Working. I Cannot Fi

No TB (Tesnorboard) is not enabled.

That explains it 🙂 did you manage to get it working ?

3 years ago

0 Hi People! I Think The Clearml

The image is

allegroai/clearml:1.0.2-108

Yep, that makes sense, seems like a backwards compatibility issue

one year ago

0 Is There An Easy Way To Add A Link To One Of The Tasks Panels? (As An Artifact, Configuration, Info, Etc)? Edit: And Follow Up Regarding The Dataset. As Discussed Somewhere Previously, The Datasets Are Now Automatically Moved To A Hidden "Sub-Project" Pr

LOL love that approach.
Basically here is what I'm thinking,
` from clearml import Task, InputModel, OutputModel

task = Task.init(...)

run this part once

if task.running_locally():
my_auxiliary_stuff = OutputModel()
my_auxiliary_stuff.system_tags = ["DATA"]
my_auxiliary_stuff.update_weights_package(weights_path="/path/to/additional/files")
input_my_auxiliary = InputModel(model_id=my_auxiliary_stuff.id)
task.connect(input_my_auxiliary, "my_auxiliary")

task.execute_remotely()
my_a...

2 years ago

0 Hi All! Is There Any Simple Way To Use

Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way

one year ago

0 Hi

👍

3 years ago

Show more results