AgitatedDove14

48 Questions, 8051 Answers

Active since 10 January 2023

Last activity 8 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8051

0 Hi All! Is There Any Simple Way To Use

Yes this is exactly the solution!
Nice 🎊 !

one year ago

0 Are There Any Particular System Dependencies Needed To Enable

Oh that is odd. Is this reproducible? @<1533620191232004096:profile|NuttyLobster9> what was the flow that required another task.init?

8 months ago

0 Are There Any Particular System Dependencies Needed To Enable

I still don't get resource logging when I run in an agent.

@<1533620191232004096:profile|NuttyLobster9> there should be no difference ... are we still talking about <30 sec? or a sleep test? (no resource logging at all?)

have a separate task that is logging metrics with tensorboard. When running locally, I see the metrics appear in the "scalars" tab in ClearML, but when running in an agent, nothing. Any suggestions on where to look?

This is odd and somewhat consistent with actu...

8 months ago

0 Are There Any Particular System Dependencies Needed To Enable

And after having called

Task.init()

the second time, the automatic logging of resources and tensorboard plots works as well. I would recommend adding explanation to the docs for

Oh yeah! you always need to call Task.init first, Task,current_task should be called from anywhere you like but after the Task.init was called.

8 months ago

0 Are There Any Particular System Dependencies Needed To Enable

Hi @<1533620191232004096:profile|NuttyLobster9>

I, but no system stats. ,,,

If the job is too short (I think 30 seconds), it doesn't have enough time to collect stats (basically it collects them over a 30 sec window, but the task ends before it sends them)
does that make sense ?

8 months ago

0 Are There Any Particular System Dependencies Needed To Enable

there is a bug wherein both

Task.current_task()

and

Logger.current_logger()

return

None

.

This is not a bug this means something broke, the environment variable CLEARML_TASK_ID Has to be set inside the agent's process
How are you running it? (also log 🙂 , you can DM so it is not public here)

8 months ago

0 <image>

Thanks!

3 years ago

0 Hello! Since Today I Get

@<1523701868901961728:profile|ReassuredTiger98> thank you so much for testing it!

3 years ago

0 <image>

now it has log, but only the initial one

So the subprocesses are not logged ?

3 years ago

0 Help Please, After Creating My Data Drift Monitoring Dashboard Using Clearml Serving And Grafana, How Can I Configure My Alerts To Be Notified When The Distribution Of My Metrics (Variables) Changes On My Heatmaps?

and this?
avg(100*increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m]))

7 months ago

0 Hello! Since Today I Get

Give me a minute

3 years ago

0 Hello! Since Today I Get

thanks!

3 years ago

0 Hello! Since Today I Get

What's the difference between the two env files?

3 years ago

0 Please Skip

Thanks again for crazyfrogspb & Tim for helping to find the extra edge case

3 years ago

0 Hello! Since Today I Get

My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?

For some reason it detect CUDA 11.1 (I assume this is what you have installed, the driver CUDA version is the highest it will support not necessary what you have installed)

3 years ago

0 Can Anyone Complete This [Demo](

I'm assuming the reason it fails is that the docker network is Only available for the specific docker compose. This means when you spin Another docker compose they do not share the same names. Just replace with host name or IP it should work. Notice this has nothing to do with clearml or serving these are docker network configurations

7 months ago

0 Hey, I'M Looking Into Clearml Pipelines For The First Time, So I Have Likely Not Fully Understood The Documentation Yet, But; Is There Any Way Where I Can Use Pipelines To Setup A Process That Will Run When An Experiment Is Published? Thanks :-)

I think that what you need is the triggers, check this one:
https://clear.ml/docs/latest/docs/references/sdk/trigger

2 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

I tested and I have no more warning messages

if self._active_gpus and i not in self._active_gpus: continueThis solved it?

If so, PR pretty please 🙂

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

BoredGoat1 where exactly do you think that happens ?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L316
?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L202

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Merged 🙂

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Awesome, thank you!!

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Okay that is odd ...

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"

Basically saying, if you can run nvidima-smi from inside the container, it should work.

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Okay could you test with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/.singularity.d/libs/

4 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Maybe permissions?!
you can test it manually by installing pynvml
and running:
from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')

4 years ago

0 Hi, Guys, I Have A Problem With Clearml-Serving. When I Try To Request Sklearn Model (I Try To Reproduce An Example

LOL 🙂
Make sure that when you train the model or create it manually you set the default "output_uri"

task = Task.init(..., output_uri=True)

task = Task.init(..., output_uri="s3://...")

2 months ago

0 I Seem To Be Missing Something ... I'Ve Only Got One Task Running To Train A Segmentation Model On My Local Machine, And In A Few Days It'S Hit Over 1.15M Api Calls. It Looks Like It'S Sending Every Single Console Output ... Are There Settings To Control

each epoch runs about 55 minutes, and that screenshot I posted earlier kind of show the logs for the rest of the info being output, if you wanted to check that out

I thought you disabled the stdout log. no?

Maybe ClearML is using

tensorboard

in ways that I can fine tune? I

You can open your TB and see, every report there is logged into clearml

one year ago

Show more results