Reputation
Badges 1
25 × Eureka!How can I specify the agent to use a specific conda environment inside the docker?
Hi CrookedWalrus33
By default it will pick the highest python in the PATH.
Then if you have a python version (in PATH) that matches the requested on on the Task, it will look for it.
Do you want to limit it to a specific python binary ?
Hi AdorableFrog70
I assume so, there's API for everything so you can always get the data. wdty?
ContemplativeGoat37 I think there was an issues just lije you described and it was solved in later versions, upgrade to the latest clearml package version, you should be fine π
Hi CrookedWalrus33
docker_setup_bash_script= ["export PATH=""/workspace/miniconda/bin:$PATH"])
Oh I think you are correct, this should do the trick:docker_setup_bash_script= ["export PATH=/workspace/miniconda/bin:$PATH", "export LOCAL_PYTHON=/workspace/miniconda/bin/python3"]
This will make sure both agent and script execute on the same python
but to run a script inside a docker which already has the environment built in.
If this is already activated, the latest agent w...
Hi CrookedWalrus33
the python version is auto detected and register in "manual execution" time (i.e. when you run your code on your machine).
That said this is a suggestion for the agent, and only if it can actually find the matching Python version it will use it, otherwise it will use whatever is
available (i.e. Look through the PATH environment for a matching pythonX.Y
executable)
The easiest way to support would just make sure the python binary's path is added to the PATH env.
Does...
The agent is using Bash (but when you add command line to the docker run, .bashrc is not executed, hence no conda
in PATH)
Maybe add the full path to the conda executable:ocker_setup_bash_script= [ "export PATH=""/workspace/miniconda/bin:$PATH", "export LOCAL_PYTHON=/workspace/miniconda/bin/python3","/workspace/miniconda/bin/conda activate /PATH_GOES_HERE"])
CrookedWalrus33 can you send the entire log? (you can DM it to me)
I want to download an exact folder/batch of the dataset to my local machine to check data out without downloading whole dataset.
TeenyBeetle18 the closest you can get is to download only one part of the dataset, if this is a multi part dataset (i.e. the dataset version is larger than the default 500MB, so you have multiple izp files, and you just want to download one of them, not all of them).
This can actually be achieved with:Dataset.get_local_copy(..., part=0)
https://githu...
Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166
is there a way for me to get a link to the task execution? I want to write a message to slack, containing the URL so collaborators can click and see the progress
WackyRabbit7 Nice!
basically you can use this one:task.get_output_log_web_page()
CheerfulGorilla72 could it be the server address has changed when migrating ?
HugeArcticwolf77 you can add --services-mode
to the agent, and it will basically keep on spinning Tasks in parallel (unfortunately the open source version does not include a way to limit it to a maximum of concurrent Tasks)
I'm running agent inside docker.
So this means venv mode...
Unfortunately, right now I can not attach the logs, I will attach them a little later.
No worries, feel free to DM them if you feel this is to much to post them here
can we somehow in clearml-session choose the pool of ports for work?
Yes, I think you can.
How do you spin the worker nodes? Is it Kubernetes ?
GorgeousSeagull44 I think this should have worked (basically replacing all the links on the mongo DB with the new IP)
Long story short, not any longer (in previous versions of k8s it was possible, but after the runtime container change it is not supported)
Hi StickyBlackbird93
Yes, this agent version is rather old ( clearml_agent v1.0.0
)
it had a bug where pytorch wheel aaarch broke the agent (by default the agent in docker mode, will use the latest stable version, but not in venv mode)
Basically upgrade to the latest clearml-agent version it should solve the issue:pip3 install -U clearml-agemnt==1.2.3
BTW for future debugging, this is the interesting part of the log (Notice it is looking for the correct pytorch based on the auto de...
Oh sorry, from the docstring, this will work:
` :param bool continue_last_task: Continue the execution of a previously executed Task (experiment)
.. note::
When continuing the executing of a previously executed Task,
all previous artifacts / models/ logs are intact.
New logs will continue iteration/step based on the previous-execution maximum iteration value.
For example:
The last train/loss scalar reported was iteration 100, the next report will b...
worker nodes are bare metal and they are not in k8s yet
By default the agent will use 10022 as an initial starting port for running the sshd that will be mapped into the container. This has nothing to do with the Host machine's sshd. (I'm assuming agent running in docker mode)
Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.
This is exactly what should happen, are you saying that for some reason it fails?
VivaciousWalrus21 I took a look at your example from the github issue:
https://github.com/allegroai/clearml/issues/762#issuecomment-1237353476
It seems to do exactly what you expect. and stores its own last iteration as part of the checkpoint. When running the example with continue_last_task=int(0)
you get exactly what you expect
(Do notice that TB visualizes these graphs in a very odd way, and it took me a few clicks to verify it...)
Hi VivaciousWalrus21 I tested the sample code, and the gap was evident in Tensorboard as well. This is not clearml generating this jump this is internal (like the auto de/serialization and continue of the code base)
Hmm so the concept of "company" wide configuration is supported in the enterprise version.
I'm trying to think of a "hack" to just pass these env/conf ...
How are you spinning the agent machines?
it would be clearml-serverβs job to distribute to each user internally?
So you mean the user will never know their own S3 access credentials?
Are those credentials unique per user or once"hidden" for all of them?
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
I think they (DevOps) said something about next week, internal roll-out is this week (I think)
Hi PanickyAnt52
hi, is there a way to get back the pipeline object when given a pipeline id?
Yes basically this is a specific type of Task, anything you stored on it can be accessed via the Task object, i.e. pipeline_task=Task.get_task(pipeline_id)
I'm curious, how would you use it?
BTW: since pipeline is also a Task you can have a pipeline launch a step that is a pipeline by its own
Is there any references (vlog/blog) on deploying real-time model and do the continuous training pipeline in clear-ml?
Something along the lines of this one ?
https://clear.ml/blog/creating-a-fully-automatic-retraining-loop-using-clearml-data/
Or this one?
https://www.youtube.com/watch?v=uNB6FKIi8Wg
Hmm interesting, I guess once you are able to connect it with ClearML you can just clone / modify / enqueue and let users train models directly from the UI on any hardware, is that the plan ?
SarcasticSquirrel56
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?
Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?
EDIT:
I guess to achieve what I w...