AgitatedDove14

49 Questions, 8094 Answers

Active since 10 January 2023

Last activity 10 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8094

0 Clearml-Session Question: I’M Using The Tool With An On-Prem Machine. Normal Tasks Are Being Executed Normally - But When Using

Sometimes it is working fine, but sometimes I get this error message

@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip> ?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?

2 years ago

0 Ok, Another Question: How To Get

ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101

4 years ago

0 Hi. When Using Sklearn'S

DistressedGoat23

We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress

On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?

2 years ago

0 Hi There

JitteryCoyote63 is it the same issue?

4 years ago

0 Pytorch Lightning Question About Logging A Figure. I Have The Following Code:

I'm not sure TB support confusion matrix regardless, from anywhere in your code you can do:from trains import Task Task.current_task().get_logger().report_confusion_matrix(...)

4 years ago

0 Hey, I Want To Use The Aws Autoscaler With Spot Instances, And I Was Wondering How (Or If) You Handle Interruptions. What We Currently Implemented Is A Mechanism That On Spot Failure Reruns The Training With A Flag, And Our Code Knows To Search For The La

Hi CleanPigeon16

I was wondering how (or if) you handle interruptions.

Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in you...

3 years ago

0 Hi There, I Recently Updated Clearml Server To 1.7.0, And Found The Following

Oh yes, you probably have sorting or filter applies there :)

2 years ago

0 Hi All! Question Around Resource Management Using

Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?

2 years ago

0 If I Clone A Task, I Suppose All Artifacts Are Not Cloned With It, Even If They Are Registered, Right?

Yes that makes total sense to me. How about a GitHub issue on the clearml-docs ?

2 years ago

0 Hi, We’Re Deploying Clearml On The Eks And Have An Issue With Authenticating The Server With The S3 Bucket. The Connection To S3 Bucket Is Not Working. Our Current Diagnosis: Clearml Internally Uses Aws_Access_Key_Id And Aws_Secret_Access_Key. But We A

The pod has an annotation with a AWS role which has write access to the s3 bucket.

So assuming the boto environment variables are configured to use the IAM role, it should be transparent, no? (I can't remember what the exact envs are, but google will probably solve it 🙂 _

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. I was expecting clearml to pick them by default from the environment.

Yes it should, the OS env will always override the configuration file sect...

2 years ago

0 Hey, I Have A Problem With The Following Task:

The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from

JitteryCoyote63 What do you mean by that?

Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)

4 years ago

0 Hi There! I Am Using A Custom Clearml Installed In K8S Using The Official Helm-Chart (With Some Modifications). I Am Trying To Set Up Training That Runs From An Engineer’S Local Laptop In The K8S Cluster Using Clearml-Task. The Single File Variant (E.G. T

The problem is that even when I mount the SSH key into the root home directory (e.g.,

/root/.ssh/id_rsa

with the correct permissions set to 400) I still encounter the same error.

The agent automatically mount's the .ssh folder from the host into the container, making sure all the permissions are set,

how can I run

pip install -e .

in general the agent will add the "working" dir into the PYTHONPATH so that you should not have to manually do "-e ."
Tha...

5 months ago

0 Getting This Error At

Jupyter server v 6.0.3

3 years ago

0 Hi There

JitteryCoyote63 do you have an idea on how I can reproduce it?

4 years ago

0 I Have A Questions About Queue Priorities With Clearml-Agent. I Have Two Queues,

but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).

How so?

3 years ago

0 Hi All! I Have A Question About Pipelines. My Pipeline Consists Of Several Steps:

because step can be constructed with multiple

sub-components

but not all of them might be added to the UI graph

Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?

2 years ago

0 When I Do

So this is an additional config file with enterprise?

Extension to the "clearml.conf" capabilities

Is this new config file deployable via helm charts?

Yes, you can also set it company/user wide using the clearml Vault feature (again enterprise, sorry 😞 )

2 years ago

0 Running This Code From Inside A Docker Container Locally:

It seems to fail when trying to download the model
local_download = StorageManager.get_local_copy(uri, extract_archive=False) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/manager.py", line 47, in get_local_copy cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/cache.py", line 55, in get_local_copy if helper.base_url == "file://":And based on the error I suspect the...

2 years ago

0 Hi There

Also, we added Task.update_task, a nicer way to change the script section 🙂

4 years ago

0 Hey, I'M Running A Pipeline, And 1 Stage Passed - But The Next One Failed. I Fixed The Bug For The Second One - Is There Any Way To Retry The Pipeline From The Failure?

Thanks CleanPigeon16
Could you verify Task "d1d361d1059c4f0981200f59d7683773" exists (and not archived)?

3 years ago

0 I Have An On-Prem/Free Clearml-Server Setup With Custom S3 Back-End Storage. I'M Trying Out The Clearml-Serving Capability And Not Sure What'S Failing. When I Start The Serving Containers It Can'T Retrieve The Model:

. I can't find any actual model files on the server though.

What do you mean? Do you see the specific models in the web UI? is the link valid ?

2 years ago

0 Question About Artifacts, Dynamic Vs Static And Their Relationship To Experiments Under

So dynamic or static are basically the same thing, just in dynamic, I can edit the artifact while running the experiment?

Correct

Second, why would it be overwritten if I run a different run of the same experiment?

Sorry, I meant in the same run, if you reuse the artifact name you will be overwriting it. Obviously different runs different artifacts :)

4 years ago

0 I Am Trying To Run A Task That Is Completely Detached From Git - Remotely. The Script Uploads Fine But In The Ui, The Git Repo Appears As “Origin”. When The Agent Tries To Pick This Up, It Fails On Trying To Clone “Origin”. What Can I Do To Let The Agent

RoughTiger69 how did you end up with a Task with just "origin" in the repo field ?

2 years ago

0 Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

JitteryCoyote63 what's the clearml version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?

2 years ago

0 Does Clearml Have A Testing Api? I'M Setting Up Stack To Enque Work With Clearml. Is There A Way I Can Simulate Queue And Worker Execution?

Hi @<1535069219354316800:profile|PerplexedRaccoon19>
What do you mean by simulate?
You can manually setup and run a Task if you need,
'clearml-agent execute --id task_id' add --docker for docker mode.
This will setup the env and run the task

one year ago

0 "5451Af93E0Bf68A4Ab09F654B222Ccae": { "1B790A3Da2E8D6Cd939Cf271694Fe81B": { "Metric": ":Monitor:Gpu", "Variant": "Gpu_0_Utilization", "Value": 0.0, "Min_Value": 0.0,

Is gpu_0_utilization also in % then?

Correct 🙂

I was trying to find, what are those min and max value for above metrics.

Oh that makes sense, notice that you can get the values over time, so you can track the usage over the experiment lifetime (you can of course see it in the Scalar tab of the experiment)

2 years ago

0 Hey All -- I'M Fairly New To This But, As Of Today, My Required Packages Aren'T Being Recognized In Cloned Runs And They Are Repeatedly Failing. Has Anyone Had Similar Issues/Found A Fix?

Not sure why, but for some reason it seems it is failing to analyze the code, hence the warning and no packages...
Any other hints on your setup that might help to better understand the root cause ? maybe home folder with unicode characters ? python installed in a specific way?

2 years ago

0 I Wanted To Ask About Html Reporting, If I Want To Do A More Fancy Visualization (Like Overlay Of Two Images Depending On Mouse Hovering), I Have To Inject This Html Into The Reporting Code, Right? I Am Asking, As Perhaps It Is Possible To Have Custom Wid

HealthyStarfish45 what exactly did you have in mind, in terms of the widget ?

4 years ago

0 Any Info On The Lifecycle Of Datasets Downloaded To $Home/.Clearml/Cache/Storage_Manager/Datasets Via Get_Local_Copy I Have A Task Running And I Was Watching The Above Path And Datasets Were Being Downloaded And Then They Are All Removed And For A Partic

So was definitely related to the symlinks in some form

could it be it actually deleted the cache? How many agents are running on the same machine ?

3 years ago

0 Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

My understanding is that on remote execution Task.init is supposed to be a no-op right?

Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.

This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there

7 months ago

Show more results