JitteryCoyote63 I think I failed explaining myself.
- I think the problem of the controller is that you are interacting (aka changing hyper parameters)) with a Task created using new SDK version, with an older SDK version. specifically we added section names to the hyper parameters, and only new version of the SDK is aware of it.
Make sense? - Regrading the actual problem. It seems like this is somehow related to the first one, the task at run time is using an older SDK version , and I t...
UpsetTurkey67 my apologies I just notices the message
I would recommend reading this blog post, it should give you a glimpse of what can be built 🙂
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
GentleSwallow91 what you are looking for is here 🙂
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
Sure thing, any specific reason for querying on multi pod per GPU?
Is this for remote development process ?
BTW: the funny thing is, on bare metal machines multi GPU woks out of he box, and deploying it with bare metal clearml-agents is very simple
Hi CloudySwallow27
Is there a way to still use the auto_connect but limit the amount of debug imgs?
Basically you can set the number of image it will store for you (per title/series combination)m the way it works it rotates the image names so essentially overriding old images (the UI is ware and will only show the last X of them)
See here on setting it:
https://github.com/allegroai/clearml/blob/81de18dbce08229834d9bb0676446a151046e6a7/docs/clearml.conf#L32
EnviousStarfish54 you can also run the docker-compose on one of the machines on your local LAN. but then you will not be able to access it from home 🙂
(I'll make sure it is added to the docstring because apparently it was not there
Apparently it ignores it and replaces everything...
Hmmm, can you view the settings? that's the only thing I can think of at the moment that will be diff between your setup and the working one...
Also, is there a way for you to have the trains-server behind https (on your GCP)
Oh you achieve exactly the same with plotly and te restapi/python interface.
Basically pull data from tasks , create visualization and log it on one if the Task or on a new one
Hi CleanWhale17 , at least for the moment, the code although open ( https://github.com/allegroai/trains-web ) has no external theme/customization interface.
That said we do have some thoughts on it.., What did you have in mind ?
Yes JitteryCoyote63 I think you are correct, this currently the easiest to do. PompousParrot44 notice that you should have a "services" queue with a trains-agent "services mode" running to enqueue those type pf mostly sleeping services 🙂
I was thinking we can quickly create a service that does that, maybe leverage one of these ?
https://github.com/mehrdadmhd/scheduler-py
https://github.com/dbader/schedule
WDYT?
that's the entire repo link ? not something like https://github.com/ ... ?
Yeah I can write a script to transfer it over, I was just wondering if there was a built in feature.
unfortunately no 😞
Maybe if you have a script we can put it somewhere?
If you use this one for example, will the component have pandas as part of the requirement
None
def step_two(...):
import pandas as pd
# do stuff
If so (and it should), what's the difference, where is "internal.repo " different from pandas ?
Hi BattyLizard6
does clearml orchestration have the ability to break gpu devices into virtual ones?
So this is fully supported on A100 with MIG slices. That said dynamic multi-tenant GPU on Kubernetes is a Kubernetes issue... We do support multi agents on the same GPU on bare metal, or over shared GPU instances over k8s with:
https://github.com/nano-gpu/nano-gpu-agent
https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/gpu_plugin#fractional-resources
http...
Thank you MuddyCrab47 !
Regrading model versioning:
All models are logged automatically by trains (no need so specify it, as long as you are using one of the automagically connected frameworks: PyTorch/keras/TF/SKlearn)
You can see see how it looks like on the demoapp:
https://demoapp.trains.allegro.ai/projects/5371015f43f043b1b4ad7203c1ff4a95/models
Regrading Dataset management, we have a simple workflow demonstrated below, bascially using artifacts as dataset storage, with very easy int...
TenseOstrich47 this sounds like a good idea.
When you have a script, please feel free to share, I think it will be useful for other users as well 🙂
Okay, so you want to take the jupyter notebook (aka colab) and have that experiment show on Trains, then use the Trains UI to launch it remotely on one of the machines running the trains-agent. Is that correct?
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
Would be cool to let it get untracked as well, especially if we want to as an option
How would you decide what should be tracked?
RoughTiger69 yes I think "Scale" tier covers it 😉
but why is it mounted only once?
Are you saying the second time this line is missing? this is very strange...
Can you send the full Task log?
Hi WickedBee96
Queue1 will take 3GPUs, Queue2 will take another 3GPUs, so in Queue3 can I put 2-4 GPUs??
Yes exactly !
if there are idle GPUs so take them to process the task? o
Correct, basically you are saying, this queue needs a minimum of 2 GPUs, but if you have more allocate them to the Task it pulled (with a maximum of 45 GPUs)
Make sense ?
I think this is the main issue, is this reproducible ? How can we test that?
Okay let me check if I can test on this git version.
I am struggling with configuring ssh authentication in docker mode
GentleSwallow91 Basically the agent will automatically mount the .ssh into the container , just make sure you set the following in the clearml.conf:force_git_ssh_protocol: truehttps://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L30