Reputation
Badges 1
25 × Eureka!or me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.
Do notice the autoscaler code itself needs to run somewhere, by default it will be running on your machine, or on a remote agent,
So sorry for the delay here! totally lost track of this thread
Should I be deploying the entire docker file for every model, with the updated requirements?
It's for every "environment" i.e. if models need the same set of python packages , you canshare
CharmingStarfish14 can you check something from code, just to see if this would solve the issue?
Hi @<1556812486840160256:profile|SuccessfulRaven86>
I'm assuming this relates to the SaaS service.
API calls are away to measure usage, basically metric reports are bunched into a single call, agents pings / query is API call, and so on so forth.
How many hours you had training tasks reporting data? how many agents running and so on
MelancholyChicken65 found it ! thank you for finding this issue.
I'm hoping to get an update soon 🙂
i had a misconception that the conf comes from the machine triggering the pipeline
Sorry, this one :)
How do you currently report images, with the Logger or Tensorboard or Matplotlib ?
Guys I think I lost context here 🙂 what are we talking about? Can I help in anyway ?
okay, let me check it, but I suspect the issue is running over SSH, to overcome these issues with pycharm we have specific plugin to pass the git info to the remote machine. Let me check what we can do here.
FiercePenguin76 BTW, you can do the following to add / update packages on the remote sessionclearml-session --packages "newpackge>x.y" "jupyterlab>6"
EnviousStarfish54
Can you check with the latest clearml from github?pip install git+
If the only issue is this linetask.execute_remotely(..., exit_process=True)It has to finish the static analysis of the entire repository (which usually happens in the background but now we have to wait for it). If the repo is large this could actually take 20sec (depending on CPU/drive of the machine itself)
Hmm I would recommend passing it as an artifact, or returning it's value from the decorated pipeline function. Wdyt?
Wait @<1523701066867150848:profile|JitteryCoyote63>
If you reset the Task you would have lost the artifacts anyhow, how is that different?
cleamrl sdk (i.e. python client)
The issue is that the Task.create did not add the repo, link (again as mentioned above, you need to pass the local folder or repo link to the repo argument of the Task.create function). I "think" it could automatically deduce the repo from the script entry point, but I'm not sure. hence my question on the clearml package version
Oh I see, what you need is to pass '--script script.py' as entry-point and ' --cwd folder' as working dir
Ohh then use the AWS autoscaler, basically it what you want, spin an EC2 and set an agent there, then if the EC2 goes down (for example if this is a spot), it will spin it up again automatically with the running Task on it.
wdyt?
The imports inside the functions are because the function itself becomes a stand-alone job running on a remote machine, not the entire pipeline code. This also automatically picks packages to be installed on the remote machine. Make sense?
What do you mean by "tag" / "sub-tags"?
SmarmySeaurchin8 checks the logs, maybe you can find something there
Hi @<1523702000586330112:profile|FierceHamster54>
I think I'm missing a few details on what is logged, and ref to the git repo?
Now I'm curious what's the workaround ?
Sure, in that case, wait until tomorrow, when the github repo is fully synced
build your containers off these two? or are you building directly from code ?
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug,...
It should move you directly into the queue pages.
Let me double check (working on the community server)
you could also use:
https://github.com/allegroai/clearml/blob/ce7e77a00e869a2690f31cbc578636ce88bc4613/docs/clearml.conf#L188
and setup the clearml.conf on the users machine to automatically log the environment variables at run time (stored under the Configuration tab).
Then the agent will pull these same variables at execution time and set them
You are doing great 🙂 don't worry about it
This looks like 'feast' error, could it be a configuration missing?