AgitatedDove14

49 Questions, 8124 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

25 × Eureka!

Answers 8124

0 When I Send A Request Of Creating A User To Clearml-Apiserver, It Logs An Error Likes This:

😞 anything that can be done?

2 years ago

0 Hello Everyone. I Don'T Uderstand Why Is My Training Slower With Connected Tensorboard Than Without It. I Have Some Thoughts About It But I Not Sure. My Internet Traffic Looks Wierd.I Think This Is Because Tensorboard Logs Too Much Data On Each Batch And

Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and in...

3 years ago

0 Getting This Error At

I verified the "exit(0)" error, let me check something

4 years ago

0 Hi Quick Question. If I Use Clearml-Data To Upload A Dataset To A Remote Folder Which Is Mounted At, Say, /Mnt/Something/Data, When I Use Dataset.Get_Local_Copy(), It Looks Like It Is Unzipping That Data Also In The Remote Folder And Thus Returning The A

do I need to create a brand new dataset with a new name that inherits from the original?

Yes, you just create a new version, specify the parent one, add changes and close it.
If you later need you can squash a version (same ides as git squash). Make sense ?

4 years ago

0 I Have A Notebook Which Is Uncommited. It Is Being Run On A Remote Machine With Clearml-Agent Through Clearml-Session. Everything With Newest Versions, Server Is Community-Hosted. Under Uncommitted Changes I See

Hi FiercePenguin76
It seems it fails detecting the notebook server and thinks this is a "script running".
What is exactly your setup?
docker image ?
jupyter-lab version ?
clearml version?
Also are you getting any warning when calling Task.init ?

4 years ago

0 Hello, I'M A Bit Lost In The Docs For The Mlops, I Have Script Which Already Integrate Clearml Logging, Should I Use Clearml-Task To Launch It On An Agent ? (I Already Have A Clearml-Server And A Clearml-Agent Running).

Hi VirtuousFish83
Apologies for the documentation in the docs 🙂 It sounds complicated but actually should be relatively simple. Based on what I understand, you already have the server setup and you code integrated. The question is "can you see an experiment in the UI"? If you do, then you can right click it, clone the experiment , edit parameters and send for execution (enqueue). If the experiment is not in the UI you can either (1) run the code with the Task.init call, it ill automatica...

4 years ago

0 Do Tasks That Are Created Through Create_Function_Task Run The Entry_Script Again Instead Of Just The Pure Function?

Hi JealousParrot68

do tasks that are created through create_function_task run the entry_script again instead of just the pure function

Basically they will run the code until the "create_function_task" call, but never after. We are working on adding a decorator to a function, making it a "standalone" script, is this what you actually need ?

4 years ago

0 Hello, Want To Ask Here. I Try To Host My Own Clear Ml Server. It Turns Out That The Whole Clear Ml Server Took So Much Memory Usage, Especially For Elastic Search. Are There Any Workaround To Minimize The Memory Usage?

Hi @<1784754456546512896:profile|ConfusedSealion46>

clear ml server took so much memory usage, especially for elastic search

Yeah that depends on how many metrics/logs you have there, but you really have to have at least 8GB RAM
delete old experiments ?

8 months ago

0 Hi, I Would Like To Follow-Up In This

Hi JitteryCoyote63

Is it possible to rollback from 1.2.0 to 1.1.0?

Not really there was a DB migration so out of the box downgrade is not really supported.
That said, v1.3.1 is already out, with what seems like a fix:
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)

3 years ago

0 I Am Trying To Do A Remote Execution Of A Test Task, But It Fails During Env Setup Due To Trying To Install An Obscure Version Of Pytorch. Been Trying To Solve This For Three Days! The Script:

Can you try to manually install it and see what you are getting?
python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl

2 years ago

0 Hi

I think it is free 🙂 (registration that is)

3 years ago

0 Hi! Is There A Way To Run A Task Without Reporting To The Server? For Example If I Want To Debug A Script By Running It Locally Without It Appearing On The Server

https://clear.ml/docs/latest/docs/references/sdk/task#taskimport_offline_session

4 years ago

0 Can Someone Help Me With Deploying This Example Model (From Triton Inference Server) Deployed In Clearml-Serving? Too Many Random Errors For Me To Figure It Out

Do we launch multiple gorups of these in different projects?

Actually Triton can serve multiple models and the endpoints/models are controlled from the clearml-serving.
The only issue is adding a load-balancer in front of multiple nodes to balance the requests between them. wdyt?

4 years ago

0 Avoiding

Hi RoughTiger69

How about using the pipeline decorator as a way to run this logic?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py

I think I'm missing the context of where the code is executed....

btw: you can now set the configuration_objects directly when calling add_step 🙂
https://clearml.slack.com/archives/CTK20V944/p1633355990256600?thread_ts=1633344527.224300&cid=CTK20V944

3 years ago

0 Hi, I Run The Trains Server In An Docker Container And Started Making Use Of Tasks ... My Tests Are Showed On The Projects Dashboard Which Is Realy Cool. What I Haven'T Found So Far Is A Way To Clean Up The System From The Tests I Did. I'M Able To Archive

WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent

4 years ago

0 I Am Trying To Plot Values That Are Either 0 Or 1 (With Tensorboardx.Add_Scalar). However, It Doesn'T Show Correctly. Any Idea Why? (Smoothing Is 0)

Let me check what's the subsampling threshold

4 years ago

0 Hi, I Wanted To Try Model Versioning, Suppose That I'Ve A Model And Want To Have Multiple Versions Of The Same Model And To Be Able To Have Inference On These Models(For Example

making me realize that this may have been optional

I think it is optional, and this is why it was not entered in the first place.
Could you double check and just remove it from your manual pbtxt ?

one year ago

0 Hi People, I Am Using Pytorch-Lightning Together With Trains, And Came Across A Trainslogger That Was Available In Previous Lightning Versions And Is Currently Deprecated. I Was Wondering, What Is The Recommended Way To Go About It? On The One Hand I Get

Hi RipeGoose2
There is no need for any TrainsLogger in pytorch lightning as they switched to using the tensorboard logging by default, and everything they pass there we automagically catch.
What do you think is missing? or can be improved ?

4 years ago

0 How Do I Think About Tasks/Task_Name-S? Do I See Right If I Run The Same Task With The Same Name, It Overwrites The Previous Run? Is It Possible To Fail If The Task Already Exists And Need

ahh, because task_id is the "real" id of a task

Yes the ID is a global system wide unique ID (regardless of the project etc.)

Maybe we will call tasks as

slug_yyyymmdd

Notice that you can just copy-paste the link in the address bar, it will bring you to the exact same view, meaning easily shared among users 🙂 You can, but I would actually use the Task ID. This also means that programatically you can do , task=Task,get_task(task_id_here) and interact and query a...

2 years ago

0 Hi All! I I Tried To Run The

In that case, yes please open an Issue so we can fix it 🙂

4 years ago

0 I Have Built A Custom Docker Image And Execution Script So That I Can Use Conda As The Package Manager When Installing Python Packages For Job Execution. Everything Is Working Fine In Terms Of Environment Installation, However, On Execution Of The Model T

Hmm good question, I'm actually not sure if you can pass 24GB (this is not a limit on the GPU memory, this affects the memblock size, I think)

4 years ago

0 Hello Guys, I Have A Strange Situation With A Pipeline Controller I'M Testing Atm. If I Run The Controller Directly In My Pycharm On Notebook It Connects Correctly To The K8S Cluster With Trains Installed. After This, If I Go Directly In The Ui, I Reset T

maybe this can cause the issue?

Not likely.
In the original pipeline (the one executed from the Pycharm) do you see the "Pipeline" section under Configuration -> "Config objects" in the UI?

4 years ago

0 Hi Everyone. I Have An Issue With The Simple Pipeline - It Runs Two Similar Nn Training Steps (Tf2.3, Windows10, Python 3.7) With Only Difference Is A Batch Size. I'M Running First Separately Each Step To Have Them In Clearml Project Page. Then I Run Pipe

Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?

4 years ago

0 Hey, I Want To Use The Aws Autoscaler With Spot Instances, And I Was Wondering How (Or If) You Handle Interruptions. What We Currently Implemented Is A Mechanism That On Spot Failure Reruns The Training With A Flag, And Our Code Knows To Search For The La

Hi CleanPigeon16

I was wondering how (or if) you handle interruptions.

Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in you...

4 years ago

0 Hello, Everyone! I Have A Question Regarding Clearml Features. We Run Into The Situation When Some Of The Agents That Are Working On A Hpo Die Due To Variable Reasons. Some Workers Go Offline Or Resources Need Temporarily Be Detached For Other Needs. Thu

The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.

Good point!

. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.

Oh I see now....

Later, when more important experiments has been completed, we can continue HPO task from the same state.

Quick question when you say the HPO Task, you mean the HPO controller logic Task...

2 years ago

0 Hey, Would It Possible To Add An Option To Make

Ohh, the controller task itself holds the artifacts ?

5 years ago

0 Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

hmmm I see...
It seems to miss the fact that your process do uses the GPU.
Maybe it only happens later, that the GPU is used?
Does that make sense ?

5 years ago

0 Hello. It'D Be Really Helpful If Someone Could Let Me Know Why I Keep Getting "Misconfigurationexception('No Supported Gpu Backend Found!')" Error. I Am Using "Task.Execute_Remotely(Queue_Name="Default", Exit_Process=True)". Once It Gets Queued, I Clone I

Hi @<1715175986749771776:profile|FuzzySeaanemone21>

and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.

I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component

one year ago

0 Hi, I'M Trying To Run Task.Init Inside A Jupyter Notebook For The First Time (Used It A Lot Before In Normal Python Scripts), And I Get A Warning-

ThickDove42 Windows conda python3.6 was exactly what I was using,
started the jupyter with:
"python -m jupyter notebook"
Then opened / created a new notebook, everything worked.
Tested on latest clearml 0.17.2
Maybe it's something with the path to the repo that breaks it? Because obviously the issue is it is looking in the wrong folder.

4 years ago

0 Hi. Shoulf This Command Succeed In The Presence Of Project

well it should fail, but I think the error message should be fixed 🙂
maybe:
ValueError: dataset 'tmp_datset' not found in projectlavi-testing' `wdyt?

3 years ago

Show more results