AgitatedDove14

48 Questions, 8051 Answers

Active since 10 January 2023

Last activity 8 months ago

Reputation

Badges 1

25 × Eureka!

Answers 8051

0 Hi Guys, I Have Been Running The Clearml-Serving For A While Now And I Realize That From Time To Time After A Couple Of Hours The Serving Task (Control Plane) That Is Configured Through The Cli Goes Into Status Abort. This Happens Even Though All The Pods

how can you be snyk and lower than 0.96

Yep Snyk auto "patching" is great 🙂
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"

10 months ago

Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?

This happens even though all the pods are healthy and the endpoints are processing correctly.

The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?

10 months ago

build your containers off these two? or are you building directly from code ?

10 months ago

Woot woot, great to hear 🎊

9 months ago

no requests are being served as in there is no traffic indeed

It might be that it only pings when requests are served

what is actually setting the task status to

Aborted

?

server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it

my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default

Yeah.. let me check that
Basically this sounds like a sort of a bug,...

10 months ago

Okay we have located the issue, thanks guys! We will push a patch release hopefully later today

10 months ago

@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞

9 months ago

0 Hi Guys, We Are Running Clearml-Serving On A Kube Cluster On Aws And We Have Noticed That We Are Getting Some 502 Errors Once In A While That We Can'T Seem To Trace Back.

yeah I tend to agree... keep me posted hen you find the root cause 🤞

one year ago

0 I Have Set

"regular" worker will run one job at a time, services worker will spin multiple tasks at the same time But their setup (i.e. before running the actual task) is one at a time..

7 months ago

0 I Have Set

what if the preexisting venv is just the system python? my base image is python:3.10.10 and i just pip install all requirements in that image. Does that not avoid venv still?

it will basically create a new venv inside the container forking the existing preinistalled stuff (i.e. the new venv already has everything the python system has preinstalled)
then it will call "pip install" on all the "installed packages of the Task.
Which should just check everything is there and install nothing...

7 months ago

0 I Have Set

would those containers best be started from something in services mode?

Yes as long as the machine has enough cpu/ram
Notice that the services mode will start a second parallel Task after the first one is done setting up the env, if running with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL, with containers that have git/python/clearml-agent preinstalled it should be minimal.

or is it possible to get no-overhead with my approach of worker-inside-docker?

No do not do that, see above e...

7 months ago

0 I Have Set

Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?

7 months ago

0 I Have Set

try with the latest RC 1.8.1rc2

, it feels like after git clone, it spend minutes without outputting anything

yeah that is odd , can you run the agent with --debug (add before the daemon command) , and then at the end of the command add --foreground
Now launch the same task on that queue, you will have a verbose log in the console.
Let us know what you see

7 months ago

0 Hi. I Get Some Problem With Clearml Agent. I Start Training On My Local Device, Clone Run, And Start This Run In Docker On Cluster. But, Seems Like Clearml Agent Сaches Environment(Package Weels, Python Version, Etc). Can I Config Clearml Agent To Not Сac

StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?

2 years ago

I'm running agent inside docker.

So this means venv mode...

Unfortunately, right now I can not attach the logs, I will attach them a little later.

No worries, feel free to DM them if you feel this is to much to post them here

2 years ago

Hi StickyBlackbird93
Yes, this agent version is rather old ( clearml_agent v1.0.0 )
it had a bug where pytorch wheel aaarch broke the agent (by default the agent in docker mode, will use the latest stable version, but not in venv mode)
Basically upgrade to the latest clearml-agent version it should solve the issue:
pip3 install -U clearml-agemnt==1.2.3BTW for future debugging, this is the interesting part of the log (Notice it is looking for the correct pytorch based on the auto de...

2 years ago

0 I’M Wondering If Someone Has An Example Of How To Use The

Hi @<1533620191232004096:profile|NuttyLobster9>
base_task_factory is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:

def my_node_task_factory(node: PipelineController.Node) -> Task:
  task = Task.create(...)
  return task

Make sense ?

one year ago

0 Back To Autoscaler; Is There Any Way To Ensure The Environment Variables On The Services Queue (Where The Scaler Runs) Will Be Automatically Exposed To New Ec2 Instance? Some Bash Hack Or Similar Would Be Nice, Really

Nice!

2 years ago

0 Hey Everyone, I Have An Autoscaler Configuration That Runs Different Ec2 Instances. I Want The Ec2 Worker Launched By The Autoscaler Could Handle A Bucket With Different Aws Keys. The Configuration I Am Passing Is As Follows:

AFAIK that's the only way right now (see my comment here - https://clearml.slack.com/archives/CTK20V944/p1657720159903739?thread_ts=1657699287.630779&cid=CTK20V944 )
Or then if you have the ClearML paid service, I believe there is a "vaults" service, right AgitatedDove14 ?

Yep UnevenDolphin73 :)

2 years ago

0 Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

Oh sorry, from the docstring, this will work:
` :param bool continue_last_task: Continue the execution of a previously executed Task (experiment)

.. note::
    When continuing the executing of a previously executed Task,
    all previous artifacts / models/ logs are intact.
    New logs will continue iteration/step based on the previous-execution maximum iteration value.
    For example:
    The last train/loss scalar reported was iteration 100, the next report will b...

2 years ago

0 Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

Hi VivaciousWalrus21 I tested the sample code, and the gap was evident in Tensorboard as well. This is not clearml generating this jump this is internal (like the auto de/serialization and continue of the code base)

2 years ago

0 Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

My pleasure 🙂

2 years ago

0 Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

Hi VivaciousWalrus21

After restarting training huge gaps appear in iteration axis (see the screenshot).

The Task.init actually tries to understand what was the last reported interation and continue from that iteration, I'm assuming that what happens is that your code does that also, which creates a "double shift" that you see as the jump. I think the next version will try to be "smarter" about it, and detect this double gap.
In the meantime, you can do:
` task = Task.init(...)...

2 years ago

0 Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.

This is exactly what should happen, are you saying that for some reason it fails?

2 years ago

0 Hi, I Have A Question About

SoggyFrog26 you'll have it in the next RC 🙂
Not sure what's the plan I know one should be out today/tomorrow, worst case on the next one 🙂

3 years ago

0 Hi, I Have A Question About

Hi SoggyFrog26
Yes, it is stored at ~/.clearml_data.json
Notice you can always change it by passing --id dataset_id

3 years ago

0 Hi, I Have A Question About

SoggyFrog26 there is a full pythonic interface, why don't you use this one instead, much cleaner 🙂

3 years ago

0 Hi, I Have A Question About

I think it would be nicer if the CLI had a subcommand to show the content of

~/.clearml_data.json

.

Actually, it only stores the last dataset id at the moment, no not much 🙂
But maybe we should have a cmd line that just outputs the current datasetid, this means it will be easier to grab and pipe
WDYT?

3 years ago

0 Hello! Since Today I Get

okay, I'll make sure we order it correctly

3 years ago

0 Is There A Way To Set Access Levels Per-User On The Trains Web App? (I'M Basically Looking To Add A Readonly User Role)

Not really 😞
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues 🙂

4 years ago

Show more results