EmbarrassedSpider34 I can update that an RC should be out later today with a fix π
I think this issue was fixed in clearml-server 1.3.0 (released after the weekend),
Let me check
VirtuousFish83 I can confirm clearml-server 1.3 solves the issue.
The second run prints out the same (non) "random" numbers as the first run
ClearML sets the initial random seed for you, basically trying to help with reproducibility. That said inside the function you can always do:import random import time random.seed(time.time())
CleanWhale17 per your request :)
An automated ML Pipeline π Automated Data Source Integration π Data Pooling and Web Interface for Manual Annotation of Images(Seg. / Classif) [Allegro Enterprise] or users integrate with open-source Storage of Annotation output files(versioned JSON) π Online-Training Β Support(for Dataset Shifts) [Not Sure what you mean] Data Pre-processessing (filter/augment) [Allegro Enterprise] or users integrate with open-source Data-set visualization(stats...
ShallowGoldfish8 how did you get this error?self.Node(**eager_node_def) TypeError: __init__() got an unexpected keyword argument 'job_id'
In our case, we have a custom YAML instruction
!include
, i.e.
Hmm interesting, in theory this might work since configuration encoding (when passing dicts), is handled with HOCON which does support referencing.
That said currently it is not aware of "remote configurations" only ENV variables and local files.
It will be cool to add, do we have a github issue on that? (would you like to see if you can PR such a thing?)
Seems like settings on the clearml-server disappeared (specifically default queue tag?!)
It seems like you are correct, everything should just work. Are you still getting the error? What's the clearml agent version?
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
is there a way to assign a job to a specific worker? or is it only working on queue level
Only on a queue level, but you can have as many as you like and spin the agent on it (notice you can have multiple queues on the same agent, pulling based on priority/order).
Actually we just added venv support as well, the reasoning is/was inside a docker it is easier to separate the running processes, with venv we had to support multiple venv running at the same time and reusing of those venv (just a bit more logic) anyhow this is now supported :)
is how you would create different queues,
SarcasticSquirrel56 you can create them from the UI, when the server is already running
(if you are saying, how do I create them in the first installaiton, then yes you are correct, this is possible in the helm chart, I think π )
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug,...
DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?
This would be a good example?
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
build your containers off these two? or are you building directly from code ?
Sure, in that case, wait until tomorrow, when the github repo is fully synced
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
how can you be snyk and lower than 0.96
Yep Snyk
auto "patching" is great π
as I mentioned wait for the GH sync tomorrow, a few more things are missing there
In the meantime you can just do ">= 0.109.1"
π CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I...
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced π€
however, this will also turn off metricsΒ
For the sake of future readers, let me clarify on this one, turning it off auto_connect_frameworks={'pytorch': False}
only effects the auto logging of torch.save/load
(side note: the reason is pytorch does not have built in metric reporting, i.e. it is usually done manually and these days most probably with tensorboard, for example lightning / ignite will use tensorboard as default metric reporting),
Woot woot, great to hear π