sorry the point where you select the interpreter for pycharm
Oh I see...
Ohh "~/trains.conf" is root probably
Hi IrritableGiraffe81
PipelineDecorator.debug_pipeline() runs everything as regular python functions, but "PipelineDecorator.run_locally()" is actually sumulating all the steps on the same local machine (so that it is easier to debug the "real" pipeline running on multiple machines)
What I think is happening is that the casting of the arguments passed to the component fail.
Basically the type hints are currently ignored (we are working on using them for casting in the next version)
but righ...
If you have idea on where to start looking for a quick win, I'm open to suggestions 🙂
just to check. Does the k8s glue install torch by default?
SubstantialElk6 what do you mean the glue installs torch ?
The glue will take a Task from the queue create a k8s job (basically use the same docker and inside the docker run get the agent to execute the requested Task). Where would the "torch" come into play?
Hi GiganticTurtle0
ClearML will only list the directly imported packaged (not their requirements), meaning in your case it will only list "tf_funcs" (which you imported).
But I do not think there is a package named "tf_funcs" right ?
Oh yes, you probably have sorting or filter applies there :)
WickedGoat98 sorry, I missed the thread...
that the trains.conf has to be located on the node running the trains-agent.
Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?
VexedCat68 actually a few users already suggested we auto log the dataset ID used as an additional configuration section, wdyt?
Hi AgitatedTurtle16
You can find documentation here:
https://github.com/allegroai/clearml-session
Basically it uses the cleaml-agents to launch a session on one of the machines in the cluster.
In the remote session itself it install jupyterlab + vscode-server, then it connects to the remote session (running on the agent's machine) automatically over ssh and creates tunnel to these services.
Hi @<1523706645840924672:profile|VirtuousFish83>
could it be you have some permission issues ?
: Forbidden: updates to statefulset spec for fields other than 'replicas',
It might be that you will need to take it down and restart it. not while it is running.
(do make sure you backup your server 🙂 )
So what will you query ?
Hmm, in the credentials popup there should be a "secure connect" checkbox, it tells it to use https instead of http. Can you verify?
Hi @<1523702000586330112:profile|FierceHamster54>
I think I'm missing a few details on what is logged, and ref to the git repo?
GrumpyPenguin23 could you help and point us to an overview/getting-started video?
from task pick-up to "git clone" is now ~30s, much better.
This is "spent" calling apt update && update install && pip install clearml-agent
if you have those preinstalled it should be quick
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
if you do not want it to install anything and just use existing venv (leaving the venv as is) and if something is missing then so be it, then yes sure that the way to go
So is there any tutorial on this topic
Dude, we just invented it 🙂
Any chance you feel like writing something in a github issue, so other users know how to do this ?
Guess I’ll need to implement job schedule myself
You have a scheduler, it will pull jobs from the queue by order, then run them one after the other (one at a time)
You mean to add these two to the model when deploying?
│ ├── model_NVIDIA_GeForce_RTX_3080.plan
│ └── model_Tesla_T4.plan
Notice the preprocess.py
is Not running on the GPU instance, it is running on a CPU instance (technically not the same machine)
could you send the entire log here?
i.e. from the "docker-compose" command line and onward
What do you mean? every Model has a unique ID, what do you consider a version?
What do you have under the "installed packages" section? Also you can configure the agent to use poetry to restore the environment (instead of pip)
Hi ExcitedFish86
Of course, this is what it was designed for. Notice in the UI under Execution you can edit this section (Setup Shell Script). You can also set via task.set_base_docker
I'm a bit confused between the distinction / how to use these appropriately --
Task.init
does not have
repo
/
branch
args to set what code the task should be running.
It detects it automatically at run time 🙂 based on what is actually being used
My ideal is that I do exactly what
Task.create
does, but the task only goes into the pipeline section rather than making a new one in the experiments section.
Do y...
Hi VivaciousPenguin66
Seems like a CUDA/CUDNN issue.
You argent is configured to work in venvmode, which mean it will pull the correct pytorch version based on the detected CUDA driver support. Speicifally you can see in the log "agent.cuda_version = 111" which means CUDA 11.1 and from the log it found the correct pytorch version:
` Torch CUDA 111 download page found
Found PyTorch version torch==1.8.1 matching CUDA version 111
Found PyTorch version torchvision==0.9.1 matching CUDA version 1...
Thanks MagnificentPig49 !
That might be me, let me check...
if fails duringÂ
add_step
 stage for the very first step, becauseÂ
task_overrides
 contains invalid keys
I see, yes I guess it it makes sense to mark the pipeline as Failed 🙂
Could you add a GitHub issue on this behavior, so we do not miss it ?
Hi DeliciousBluewhale87
This sounds like a great workflow to implement.
I guess my first question is how do you imagine the manager/director interacting with the system? What will they be shown, to allow them to approve/decline the model promotion ?