AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
So I guess pip install finished working
But the task is evidently not being executed.
This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs
Yes it seems so ๐
Hi EnchantingOstrich20
You how doe s clearml get it there?
In runtime it analyzes the code you are running looking for imports then checks the version you have actively used (i.e. active venv / python) and lists it there.
You can also override those in code, or edit them after you clone the ask and before you enqueue it for remote execution
Now Iโm just wondering if I could remove the PIP install at the very beginning, so it starts straightaway
AbruptCow41 CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
does exactly that ๐ BTW, I would just set the venv cache and this means it will just be able to restore the entire thing (even if you have changed the requirements
https://github.com/allegroai/clearml-agent/blob/077148be00ead21084d63a14bf89d13d049cf7db/docs/clearml.conf#L115
Hi JitteryCoyote63
So the main issue is backing up the elastic & mongo DB while they are running, once they are backed/restored, the server will spin as is. (Let me check regrading the reddis, it might be that since it is used for caching there is no need to actually backup the content only the configuration)
EnviousPanda91 this seems like a specific issue with the clearml-task
cli, could that be ?
Can you send a full clearml-task command-line to test ?
I'm wondering what happens if i were to host the instance and one of these were to go down from time to time in production, as the deployments provided by the helm chart are not redundant.
Long story short, it will break the clearml-server, please do not take them down, if you do need to do that, also take down the clearml-server. The python clients will wait until it is up again, so no session would be destroyed
Hi ConvincingSwan15
For the train.py do I need a setup.py file in my repo to work corerctly with the agent ? For now it is just the path to train,py
I'm assuming the train.py is part of the repository, no?
If it is, how come the agent after cloning the repository cannot find it ?
Could it be it was accidentally not added to the git repo ?
just to check. Does the k8s glue install torch by default?
SubstantialElk6 what do you mean the glue installs torch ?
The glue will take a Task from the queue create a k8s job (basically use the same docker and inside the docker run get the agent to execute the requested Task). Where would the "torch" come into play?
SubstantialElk6
Hmm do you have torch in the "installed packages" section of the Task ?
(This what the agent is using to setup the environment inside the docker, running as a pod)
CourageousKoala93 when you call Task.close() it will mark the task as completed, there is no need to do that manually. The idea with mark_completed is that you can forcefully change the state if needed, or externally stop the task and mark it completed. Make sense?
SubstantialElk6 "Execution Tab" scroll down you should have "Installed Packages" section, what do you have there?
Nice SubstantialElk6 !
BTW: you can configure your cleaml client to store the changes from the latest Pushed commit (and not the default which is latest local commit)
see store_code_diff_from_remote:
in clearml.conf:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/docs/clearml.conf#L150
That didnโt gave useful infos, was that docker was not installed in the agent machine x)
JitteryCoyote63 you mean "docker" was not installed and it did not throw an error ?
Let's assume the host has a folder for all users for persistence storage, for example '/mnt/user_data/and you have a user named 'myuser' and a matching subfolder '/mnt/user_data/myuser
Then we can do:clearml-session ... --docker "my_docker_image -v /mnt/user_data/:/host_mount/" --user-folder "/host_mount/myuser"
BTW: The next time you call clearml-session
these will become the default parameters, so no need to change anything ๐
Alright I have a followup question then: I used the param --user-folder โ~/projects/my-projectโ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
Yes you must make sure the docker can mount a persistent folder for you to work on.
Let me check what's the easiest way to do that
Yes docker was not installed in the machine
Okay make sense, we should definitely check that you have docker before starting the daemon ๐
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
It might be misleading if you are running on k8s cluster, where one cannot just -v mount
volume...
What do you think?
I still see things being installed when the experiment starts. Why does that happen?
This only means no new venv is created, it basically means install in "default" python env (usually whatever is preset inside the docker)
Make sense ?
Why would you skip the entire python env setup ? Did you turn on venvs cache ? (basically caching the entire venv, even if running inside a container)
Try this one ๐HyperParameterOptimizer.start_locally(...)
https://clear.ml/docs/latest/docs/references/sdk/hpo_optimization_hyperparameteroptimizer#start_locally
I'm trying to achieve a workflow similar to the one
You mean running everything on a single machine (manually)?
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task
ExcitedFish86 Oh if this is the case:
in your cleaml.conf:agent.package_manager.type: conda agent.package_manager.conda_env_as_base_docker: true
https://github.com/allegroai/clearml-agent/blob/36073ad488fc141353a077a48651ab3fabb3d794/docs/clearml.conf#L60
https://git...
the hack doesn't work if conda is not installedย
Of course conda needs to be installed, it is using a pre-existing conda env, no?! what am I missing
Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
And the assumption is the code is also there ?
in which I can just spawn an ad-hoc worker
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
Oh if this is the case you can probably do
` import os
import subprocess
from clearml import Task
from clearml.backend_api.session.client import APIClient
client = APIClient()
queue_ids = client.queues.get_all(name="queue_name_here")
while True:
result = client.queues.get_next_task(queue=queue_ids[0].id)
if not result or not result.entry:
sleep(5)
continue
task_id = result.entry.task
client.tasks.started(task=task_id)
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = ta...
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
That depends on the HPO algorithm, basically the will be pushed based on the limit of "concurrent jobs", so you do not end up exploding the queue. It also might be a Bayesian process, i.e. based on previous set of parameters and runs, like how hyper-band works (optuna/hpbandster)
Make sense ?
Question - why is this the expected behavior?
It is ๐ I mean the original python version is stored, but pip does not support replacing python version. It is doable with conda, but than you have to use conda for everything...
Hi ScantChimpanzee51
btw: this seems like an S3 internal error
https://github.com/boto/s3transfer/issues/197
So it's seemingly not the image, but maybe something to do with how Studio runs it as a kernel.
Yeah I think that for some reason it fails detecting this is actually jupyter noteboko (not really sure why), Thank you for double checking on the container !!