AgitatedDove14

49 Questions, 8060 Answers

Active since 10 January 2023

Last activity 9 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8060

0 Hi All, I'M Trying To Deploy Trains On Rancher (Nice Kubernetes Cluster Orchestration Project) Where I'M Quite New To Rancher And Kubernetes. I Have Been Able To Install Trains Using Helm

Maybe the only thing to worry about is making sure the IP address is stable, so if k8s replaces the node, you do not have to reconfigure the clients 🙂

4 years ago

0 Hi We Just Got The Aws Autoscaler To Create A New Instance When You Enqueue A Task To The Relevant Queue. However, For Some Reason The Task Itself Is Never Run, It Stays In The Pending State. When Looking At The Worker Details, It Says "No Queues Curren

Hi @<1551376687504035840:profile|StraightSealion9>

AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.

Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?

one year ago

0 Hi, Is There A Way To Get The Quota Used By Each Task? My "Metrics" Quota Is Filling Up Very Quickly And I Would Like To Understand What'S Causing It.

UI for some anomalous file,

Notice the metrics are not files/artifacts, just scalars/plots/console

one year ago

0 Hi, I'M Trying To Get Tensorboard Plots Into The Allegro Trains Server. Although I Followed The Example

TrickyRaccoon92 I'm not sure I follow, TB do show? and you want to add additional plotly plot ?

4 years ago

0 I Have 5 Unarchived Pipeline Runs That Were Defined With This Decorator:

Maybe we should add an option, archive components as well ...

2 years ago

0 Different Question About Warnings: I'M Getting (Infrequently) This Warning, Followed By My Script Hanging

okay the odd thing git ls-remote --get-url origin should have returned the same...
what's your git version? (git --version)

3 years ago

0 Hi, How Can I Make A Stage In A Clearml Pipeline Non-Blocking? The Scenario Is That Stages Downstream Needed Runtime Info From The First Stage, However The First Stage Needs To Continue Running To Act As A Monitor For The Other Downstream Stages.

Hi @<1523701504827985920:profile|SubstantialElk6>
I would split the first stage into two. The first one passing data to the others, the second as "monitoring ", Wdyt?

one year ago

0 Hi, When Trying To Use A Remote Agent To Train A Model, The Initial Environment Setup On The Remote Machine Fails Because The List Of Requirements Located In /Tmp/Cached-Reqsaw90Argk.Txt Contains A Link To An Aarch64 Wheel:

Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:
torch == 1.11.0+cu113 into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)

2 years ago

0 Hi All. I Am Wondering How People Tend To Use Clearml With Cross-Validation. Do You Tend To Create Separate Experiments For Each Fold? And If So, Would You Then Create Another Experiment For The Aggregated Results?

Hi RattyBat71

Do you tend to create separate experiments for each fold?

If you really want to parallelized the workload, then splitting it to multiple executions (i.e. passing an argument of the index of the same CV) makes sense, then you can compare / sort the results based on a specific metric. That said if speed is not important, just having a single script with multiple CVs might be easier to implement?!

2 years ago

0 Hi Again! I Am Doing Batch Inference From A Parent Task (That Defines A Base Docker Image). However, I'Ve Encountered An Issue Where The Task Takes Several Minutes (Approximately 3-5 Minutes) Specifically When It Reaches The Stage Of "Environment Setup Co

However, there is still a delay of approximately 2 minutes between the completion of setup,

Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)

11 months ago

0 Hello, If I Set

Hi TightElk12
Are you looking for a way to set the output_uri from environment variable ? Is this it?

3 years ago

0 Hi Guys, I Am Having Some Problems About Parsing Arguments To A Task When Running On Agent. I Have A Task A That Creates Another Task B Via Subprocess. Some Arguments Of A And B Share The Same Names. The Problem Is That They'Re Visible To B Even Though I'

I was wondering about what i can do with the agent's argparse magic

You mean how to pass arguments to components a pipeline? btw did you check the pipeline example here?
None

one year ago

0 So, Here'S A Question. Does Clearml Automatically Save Everything Necessary To Continue Training A Pytorch Language Model? Specifically, I'Ve Been Looking At The Checkpoint Folders Created When I'M Training A Huggingface Robertaformaskedlm. I Checked What

Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470

other links are the function calling it. make sense ?

3 years ago

0 Thanks For Releasing This Awesome Experiment Manager! I Was Logging A Single Training Session On Multiple Gpus (Using Detectron2), And Torch.Mp Is Called For Each Gpu. This Creates A Separate Task In Trains For Each Gpu, And Only One Of The Tasks Has The

So the way it will work, is you will also need to have a Task.init in main process (the one using the launch function) and the same Task.init in the main_func. What it does is it signals the sub processes to use the main process task. This way they all report to the same task. Obviously to test it you will need to wait for the RC (after the weekend :)

4 years ago

0 I Have An On-Prem/Free Clearml-Server Setup With Custom S3 Back-End Storage. I'M Trying Out The Clearml-Serving Capability And Not Sure What'S Failing. When I Start The Serving Containers It Can'T Retrieve The Model:

Is this like a local minio?
What do you have under the sdk/aws/s3 section ?

2 years ago

0 I Have A Problem With Clearml-Agent, The Agent Is Cloning Repository, But When Executing This Command:

🤔

2 years ago

0 Hi, I'M Trying To Run Task.Init Inside A Jupyter Notebook For The First Time (Used It A Lot Before In Normal Python Scripts), And I Get A Warning-

but this gives me an idea, I will try to check if the notebook is considered as trusted, perhaps it isn't and that causes issues?

This is exactly what I was thinking (communication with the jupyter service is done over http, to localhost, sometimes AV/Firewall software will block it, false-positive detection I assume)

4 years ago

0 Hi, I'M Trying To Run Task.Init Inside A Jupyter Notebook For The First Time (Used It A Lot Before In Normal Python Scripts), And I Get A Warning-

ThickDove42 Windows also works 😞
Any specifics on the setup?

4 years ago

0 Hi, When I Save Model Using Tf.Keras.Save_Model Or Using Modelcheckpoint Model Is Not Saved As An Artifact. Output Uri Is Set To Google Cloud Bucket. When Reporting With Logger Everything Is Stored Correctly. Do You Maybe Have Any Idea Why This Would Not

OutrageousGiraffe8 so basically replacing to:
self.d1 = ReLU()

2 years ago

0 Hi All, I'M Trying To Deploy Trains On Rancher (Nice Kubernetes Cluster Orchestration Project) Where I'M Quite New To Rancher And Kubernetes. I Have Been Able To Install Trains Using Helm

WickedGoat98
The webUI will look like the demo server 🙂https://demoapp.trains.allegro.ai/
2. curl http://server-ip:8008 should return something like:
{"meta":{"id":"78a9dc77081348e2930d1f429fd7e092","trx":"78a9dc77081348e2930d1f429fd7e092","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_subcode":0,"result_msg":"Invalid request path /","error_stack":null},"data":{}}%3. curl http://server-ip:8080 should return something like:
` <!d...

4 years ago

0 Hi, I Have An Agent That Is Running Two Experiments At The Same Time: One That Was Running For A Long Time (11H) And One That The Agent Picked Up Afterwards, While The First One Was Still Running. Context: I Have 3 Agents Up (Not In Docker Mode) And All O

that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)

4 years ago

0 Hi, Is There Any Way To Upload Data To A Clearml Dataset Without Compression At All? I Have Very Small Text Files That Make Up A Dataset And Compression Seems To Take Most Of The Upload Time And It Provide Almost No Benefits W.R.T Size

As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞

2 years ago

0 Hi All—First Off, Thanks For Being Such A Helpful And Thorough Group Of People. I Learn A Ton Just Searching Through The Channel For Problems. I’M Seeing A Weird Issue. I Have A Conda Env On My Linux Machine, And I Can Successfully Run A Training Script

I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init call add the following:
Task.add_requirements("pandas")

3 years ago

0 Here I Am Again... Can'T Find How To Create A Custom Queue

trains-agent RC (which they tell me will be out tomorrow) will have a switch to do that, just so it is easier 🙂

4 years ago

0 What Could Be The Reason For My Package To Not Be Loading Under The "Installed Packages"? I Have A

is "my_package" a local package ?
what is the output of:
pip freeze | grep my_package

3 years ago

0 Hi! Trying To Run The Following Very Basic Code. The First Few Parts Works As They Should:

Hmm let check again something.

3 years ago

0 Hello All

Hi RotundHedgehog76
I think it should work out of the box, I mean at the end both spin jupyter notebooks, which is what clearml interacts with. Are you getting any errors?

2 years ago

0 Hey! I'M Having A Weird Issue When I Run Pip Freeze Locally It'S Showing Version "Clearml==0.17.5Rc6" But When I Initiate The Task It'S Always Starting With "Clearml==0.17.2" - This Version Isn'T Accepting Tags Through The Code Etc. (I'M Manually Fixing I

It seems stuck somewhere in the python path... Can you check in runtime what's os.environ['PYTHONPATH']

3 years ago

0 How, If At All, Should We Cite Clearml In A Research Paper? Would You Like Us To? How About A Footnote/Acknowledgement?

Oh, and good job starting your reference with an author that goes early in the alphabetical ordering, lol:

LOL, worst case it would have been C ... 🙂

3 years ago

0 Hello, I Have Been Using Clearml Interactive Session For More Than 3 Months And I Am Facing With Random Ssh Disconnection Errors In Vscode Once In A While After Creating The Session. Sometimes Reconnecting Works, If It Does Not Work I Reconnect The Clear

Hi @<1699955693882183680:profile|UpsetSeaturtle37>
What's your clearml-session version? where is the remote machine ?
And yes if the network connection is bad we have seen this behavior you can try with --keepalive=true
Notice that these are SSH networking issue, not something to do with the clearml-session layer the --keepalive is trying to automatically detect these disconnects and make sure it reconnects for you.

8 months ago

Show more results