EnviousStarfish54
Can you check with the latest clearml from github?pip install git+
If the only issue is this linetask.execute_remotely(..., exit_process=True)It has to finish the static analysis of the entire repository (which usually happens in the background but now we have to wait for it). If the repo is large this could actually take 20sec (depending on CPU/drive of the machine itself)
Hmm I would recommend passing it as an artifact, or returning it's value from the decorated pipeline function. Wdyt?
Wait @<1523701066867150848:profile|JitteryCoyote63>
If you reset the Task you would have lost the artifacts anyhow, how is that different?
cleamrl sdk (i.e. python client)
The issue is that the Task.create did not add the repo, link (again as mentioned above, you need to pass the local folder or repo link to the repo argument of the Task.create function). I "think" it could automatically deduce the repo from the script entry point, but I'm not sure. hence my question on the clearml package version
Oh I see, what you need is to pass '--script script.py' as entry-point and ' --cwd folder' as working dir
Ohh then use the AWS autoscaler, basically it what you want, spin an EC2 and set an agent there, then if the EC2 goes down (for example if this is a spot), it will spin it up again automatically with the running Task on it.
wdyt?
The imports inside the functions are because the function itself becomes a stand-alone job running on a remote machine, not the entire pipeline code. This also automatically picks packages to be installed on the remote machine. Make sense?
What do you mean by "tag" / "sub-tags"?
SmarmySeaurchin8 checks the logs, maybe you can find something there
Hi @<1523702000586330112:profile|FierceHamster54>
I think I'm missing a few details on what is logged, and ref to the git repo?
Now I'm curious what's the workaround ?
Sure, in that case, wait until tomorrow, when the github repo is fully synced
build your containers off these two? or are you building directly from code ?
no requests are being served as in there is no traffic indeed
It might be that it only pings when requests are served
what is actually setting the task status to
Aborted
?
server watchdog, basically saying, no one is pinging "I'm alive" on this "Task" I should abort it
my understanding was that the deamon thread was deserializing the task of the control plane every 300 seconds by default
Yeah.. let me check that
Basically this sounds like a sort of a bug,...
It should move you directly into the queue pages.
Let me double check (working on the community server)
you could also use:
https://github.com/allegroai/clearml/blob/ce7e77a00e869a2690f31cbc578636ce88bc4613/docs/clearml.conf#L188
and setup the clearml.conf on the users machine to automatically log the environment variables at run time (stored under the Configuration tab).
Then the agent will pull these same variables at execution time and set them
You are doing great š don't worry about it
This looks like 'feast' error, could it be a configuration missing?
... indicate the job needs to be run remotely? Iām imagining something like
clearml-task and you need to specify the queue to push your Task into.
See here: https://clear.ml/docs/latest/docs/apps/clearml_task
If you could provide the specific task ID then it could fetch the training data and study from the previous task and continue with the specified number of trainings.
Yes exactly, and also all the definitions for the HPO process (variables space, study etc.)
The reason that being able to continue from a past study would be useful is that the study provides a base for pruning and optimization of the task. The task would be stopped by aborting when the gpu-rig that it is using is neede...
Actually doesn't matter (systemd and init.d are diff ways to spin services on diff linux distros) you can pick whatever seems more continent for you, and whichever is supported by the linux you are running (in most cases both are) š
Any chance your code needs more than the main script, but it is Not in a git repo? Because the agent supports either single script file, or a git repo with multiple files
New python executable in /home/smjahad/.clearml/venvs-builds/3.6/bin/python2
This is the output of venv create this is odd.
Could it be that by accident you did:pip install cleamrl-agentand notpip3 install clearml-agentand now it is running on python2 (which would explain the error) ?
I would uninstall/reinstall on python3 to verify
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restartDid you try the above, and you are still getting the same error ?
Hi LazyLeopard18 ,
See details below, are you using the win10 docker-compose yaml?
https://github.com/allegroai/trains-server/blob/master/docs/install_win.md
yup, it's there in draft mode so I can get the latest git commit when it's used as a base task
Yes that seems to be the problem, if it is in draft mode, you have no outputs...