I think prefix would be great. It can also make it easier for reporting scalars in general
Actually those are "supposed" to be collected automatically by pytorch and reported by the master node.
currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.
Also "should" be part of pytorch ddp
It's launched with torchrun
I know there is an integration with torchrun (the under the hood infrastructure) effort, I'm not sure where it stands....
Hi @<1686547344096497664:profile|ContemplativeArcticwolf43>
In the 2nd 'Getting Started' tutorial,
Could you send a link to the specific notebook?
. But whenever a task is picked, it fails for the following
You mean after the Task.init
call?
Well it should work out if the box as long as you have the full route, i.e. Section/param
Also what's the additional p
doing at the last line if the screenshot ?
Yes you can π (though not on the open-source version)
GloriousPenguin2 could you open a GitHub issue on it? Just making sure this will actually get fixed π
Hi GiddyTurkey39
us the config file connect to the Task via Task.connect_configuration
?
Could you see if that makes a difference ?
But this config should almost never need to change!
Exactly the idea π
notice the password (initially random) is also fixed on your local machine, for the exact same reason
exactly! it is very cool to see it in action, and it really works very well, kudos for these guys
Hi SmallDeer34
I need some help what is the difference between the manual one and the automatic one ?
from your previous log, this is the bash command executed inside the container, can you try to "step by step" try to catch who/what is messing it up ?
` docker run -it --gpus "device=1" -e CLEARML_WORKER_ID=Gandalf:gpu1 -e CLEARML_DOCKER_IMAGE=nvidia/cuda:11.4.0-devel-ubuntu18.04 -v /home/dwhitena/.git-credentials:/root/.git-credentials -v /home/dwhitena/.gitconfig:/root/.gitconfig -v /tmp/...
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
Hi @<1570220858075516928:profile|SlipperySheep79>
I think this is more complicated than one would expect. But as a rule of thumb, console logs and metrics are the main ones. I hope it helps? Maybe sort by number of iterations in the experiment table ?
BTW: probable better to ask in channel
Hi SmallDeer34
On the SaaS you can right click on an experimenter and publish it π
This will make the link available for everyone, would that help?
If the load balancer it Gateway can do the computation and leverage caching,
Oh that's True. But unfortunately out of scope for the open-source (well at the end someone needs to pay our salaries π )
Iβd prefer not to have our EC2 instance directly exposed to the public Internet.
Yep, I tend to agree π
PompousBeetle71 , These are cuda versions, I'm looking for the nvidia driver version for example 440.xx or 418.xx .
The reason is, we set an OS environment for the driver, and I remember that old drivers did not support it . Basically they do not support NVIDIA_VISIBLE_DEVICES=all , so I'm trying to see if that's the case, then we could add fix .
I'm guessing this is done through code-server?
correct
I'm currently rolling a JupyterHub instance (multiuser, with codeserver inside) on the same machine as clearml-server. Thatβs where tasks are executed etc. so, all browser dev env.
Yeah, the idea with clearml-session each user can self serve themselves the container that works best for them. With a jupyterhub they start to step on each other's toes very quickly ...
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
Hi PlainSquid19
Did you check the website https://allegro.ai ?
If you need more info I would just fill-in the contact info, I'm sure the sales guys will get back to you soon π
SmarmySeaurchin8 what do you think?
https://github.com/allegroai/trains/issues/265#issuecomment-748543102
task.connect_configuration
Hi GiddyTurkey39
Are you referring to an already executed Task or the current running one?
(Also, what is the use case here? is it because the "installed packages are in accurate?)
Depends on what you want to do, what do you want to do ?
WickedGoat98
I will try to collect the installation steps in a document and share it to the community once ready
Thank you! this will be awesome !
We're here if you need anything π
Hi @<1545216070686609408:profile|EnthusiasticCow4> let me know if this one solves the issue
pip install clearml==1.14.2rc0
Hi SmugTurtle78
Unfortunately there is no actual filtering for these logs, because they are so important for debugging and visibility. I have to ask, what's the use case to remove some of them ?
Hi @<1523701260895653888:profile|QuaintJellyfish58>
You mean some "daemon service" aborting Tasks that do not end after X hours? or is it based on CPU/GPU utilization?