AgitatedDove14

49 Questions, 8060 Answers

Active since 10 January 2023

Last activity 9 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8060

0 Hi, I Am Looking To Upload "Already Trained Models" As Experiments In My Clearml Server. How Should I Go About Doing That? Clearml Picks Up The Tensorboard Automatically While It'S Training And Reports It But How Would I Do This If I Had Everything Alread

SmarmyDolphin68 sadly if this was not executed with trains (i.e. the offline option of trains), this is not really doable (I mean it is, if you write some code and parse the TB 😉 but let's assume this is way to much work)
A few options:
On the next run, use clearml OFFLINE option, (i.e. in your code call Task.set_offline() , or set env variable CLEARML_OFFLINE_MODE=1) You can compress the upload the checkpoint folder manually, by passing the checkpoint folder, see https://github.com...

3 years ago

0 Sorry Folks Too Many Questions - If I Have A Project (And I Set The Output Uri In It While Creating, To A S3 Folder) How Can I Ensure That A Experiment (Task) That I Run On My Local Outputs The Model To The Uri?

Notice that the actual configuration that is used is the https://github.com/allegroai/clearml/blob/b21e93272682af99fffc861224f38d65b42c2354/clearml/backend_config/bucket_config.py#L23
But it is created here:
https://github.com/allegroai/clearml/blob/b21e93272682af99fffc861224f38d65b42c2354/clearml/backend_config/bucket_config.py#L199

3 years ago

0 Hey All. Quick Question About The

ClumsyElephant70 the odd thing is the error here:
docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash )?

3 years ago

0 Hi There, I'M Training A Pytorch Model And Save It Every Epoch. It Seems Like The Model Wights Are Overridden And I Can'T Choose The Best Model After The Experiment Ends. This Feature Is Missing Or I'M Not Using The Library Well?

SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂

4 years ago

0 Getting This Error At

Jupyter server v 6.0.3

3 years ago

0 Regarding The New Version 1.1.2, I Have Noticed Type Hints Are Now Included In The Script Generated By

Looks great, let me see if I can understand what's missing, because it should have worked ...

3 years ago

0 Thank You For Your Help So Far. I Have A Question About Trains Authentication And Privacy When Deploying On K8S. I Want Integrate Building A Trains-Server Into Our Iac. Now That I Got A Server To Work With An Agent Deployment Im Thinking About Authorizati

Hi ColossalAnt7
Following on SuccessfulKoala55 answer

I saw that there is a config file where you can specify specific users and passwords, but it currently requires

mount the configuration file (the one holding the user/pass) into the pod from a persistent volume .

I think the k8s way to do this would be to use mounted config maps and secrets.

You can use ConfigMaps to make sure the routing is always correct, then add a load-balancer (a.k.a a fixed IP) for the users a...

4 years ago

0 I Have A Reporting Task I Want To Schedule Using Taskscheduler. 2 Main Input Params Are

the Task scheduler itself is a Task. What we did is we added a new parameter section on the Task (the task.connect call), so that we can later clone and modify it and use the new value in runtime
(Task.connect will put the data from the Task/UI back into the dict when the agent is running the Scheduler)
Does that make sense?

2 years ago

0 I Have A Self-Hosted Clearm-Server And And Clearml-Agent Started With

ReassuredTiger98

Can you explain what you meant by

entropy point file?

There is no need to specify entry point file.
It is automatically detected when you run the Code manually on your machine.
My assumption was that the file "src/run_task.py" (based on your log) is just a test file, and hence was not added top the repository. So the agent failed to actually restore it from the git (files that are not added are not considered part of the git diff, this is usually git behavio...

3 years ago

0 Getting This Error At

Any idea why the Pipeline Controller is Running despite the task passing?

What do you mean by "the task passing"

3 years ago

0 Very Weird Error, Trying To Run An Experiment Through An Agent In Docker Mode, And I Get This Error

yes

4 years ago

0 Hey All. Quick Question About The

okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"

3 years ago

0 Hi Everyone, I'M Using Clearml-Serving With Triton And Have A Couple Of Questions Regarding Model Management:

... Would not work for huge llm style models.

yes I agree... but then if the model is small enough then you can just keep it in memory ...

6 months ago

0 Hi Everyone!

Hi @<1546303293918023680:profile|MiniatureRobin9>

Im not sure to understand the difference between a worker and an agent.

hmm we should probably make that clearer 🙂
agent = the clearml-agent instance running on the machine
worker is the system term representing the instance of the agent
You can have one machine with multiple agents (i.e. multiple workers) running on it.
Does that make sense ?

one year ago

0 I Have Install A Python Environment By Virtualenv Tool, Let'S Say

I have install a python environment by virtualenv tool, let's say

/home/frank/env

and python is

/home/frank/env/bin/python3.

How to reuse the virtualenv by setting clearml agent?

So the agent is already caching the entire venv for you, nothing to worry about, just make sure you have this line in clearml:
https://github.com/allegroai/clearml-agent/blob/249b51a31bee97d63f41c6d5542e657962008b68/docs/clearml.conf#L131
No need to provide it an existing...

2 years ago

0 Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)

2 years ago

0 Hello, Is It Possible To Run Trains Offline Where There'S No Http Connection Between The Node Running The Job And Where The Web Ui Runs? I See In Your Diagram The Connection Between Training Machine And Trains Server (Which Contains The Web Ui) Is Over Ht

copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub

4 years ago

0 Hi, I Am New To Clearml: When I Run An Experiment In A Shell (Having A Single Line Of Task Definition) I Am Getting This Error:

hmm that would explain it failing

2 years ago

0 Hi All, Looking For Some Help When Executing Pipelines With Custom Docker Images. I Have A Component Defined And I Expect Its Python Runtime Environment To Be Managed By A Custom Docker Image (

Hi WickedStarfish97

As a result, I don’t want the Agent to parse what imports are being used / install dependencies whatsoever

Nothing to worry about here, even if the agent detects the python packages, they are installed on top of the preexisting packages inside the docker. That said if you want to over ride it, you can also pass packages=[]

2 years ago

0 Security Question: In My Journey Of Running Clearml The "Hard Way" (Self-Hosted), One Problem I Haven'T Solved Is Security. Some Discussion Here...

Hi @<1541954607595393024:profile|BattyCrocodile47>

But the files API is still open to the world, right?

No, of course not 🙂 (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...

one year ago

0 I Saw Some Talk Of Clearml + Kedro On Reddit. Is That A Good Approach?

One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.

I agree, this sounds like a "function" rather than a job, so better suited for Kedro.

organization structure

and see for yourself (this pipeline has two nodes

train_model

and

predict

)

Interesting! let me dive into that and ...

3 years ago

0 I Cannot Get The Configuration From A Task: I Run

In the documentation it warns about

.close()

"Only call Task.close if you are certain the Task is not needed."

Maybe this is not clear enough, this means you do not need to automatically Add/Log/Track things into the Task in the current process.
This does Not mean you cannot access the Task or its artifacts

Mark closed means to externally (i..e not from the process that crated the Task, maybe even from a different machine) close and mark the task as completed (this...

one year ago

0 I'Ve Just Tried Uploading A Few Datasets And They Are Not Being Detected On The Ui.

RobustRat47 I think you have to use the latest clearml package for that (1.6.0)

2 years ago

0 Hi!

Ohh I see now the force SSH did not replace the user in the SSH link (only if the original was http), right ?

3 years ago

0 Hi All! I Am A Bit Confused As To How The Python Environment Is Set. I Can Submit Jobs That Build The Environment And Run Perfectly Fine. But, If I Abort The Job -> Requeue It From The Gui, Then A Different Environment Is Installed (Which Has Some Package

Yes that was a tricky one, basically always blame pip 🙂
One machine (original parent)

agent.package_manager.type = pip
agent.package_manager.pip_version =

Which would not upgrade the pip and use the preinstalled Unpacking python-pip-whl (20.0.2-5ubuntu1.10)
The other one has:

agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'

and it instal...

10 months ago

0 Very Weird Error, Trying To Run An Experiment Through An Agent In Docker Mode, And I Get This Error

Copy paste it here 🙂

4 years ago

0 Is There A Way To Control How Many Parallel Connections Are Used When Downloading From

as i also noticed that uploads are sometimes slow, and i see here max_connections=2

Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object and an additional variable on the AzureContainerConfigurations class.
Could you PR a tested draft ? we will be able to take from there

3 years ago

0 Hi, I'M Trying To Use

Hi HappyDove3
Are you passing it this way?
task.upload_artifact(name="my artifact", artifact_object=np.eye(3,3))
https://github.com/allegroai/clearml/blob/5953dc6eefadcdfcc2bdbb6a0da32be58823a5af/examples/reporting/artifacts.py#L38

2 years ago

0 Hi! How To Fix This Error With Response?

AbruptHedgehog21 could it be the console log itself is huge ?

2 years ago

0 What Sort Of Integration Is Possible With Clearml And Sagemaker? On The Page

print(requests.get(url='


print(requests.get(url='

one year ago

Show more results