Reputation
Badges 1
606 × Eureka!The script is intended to be used something like this:script.py train my_model --steps 10000 --checkpoint-every 10000
orscript.py test my_model --steps 1000
When you say it is an SDK parameter this means that I only have to specify it on the computer where I start the task from, right? So an clearml-agent would read this parameter from the task itself.
Python 3.8.8, clearml 1.0.2
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=1_llvm
absl-py=0.12.0=pypi_0
aiostream=0.4.2=pypi_0
attrs=20.3.0=pypi_0
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
cachetools=4.2.1=pypi_0
certifi=2020.6.20=py37_0
chardet=4.0.0=pypi_0
clearml=0.17.4=pypi_0
cloudpickle=1.6.0=py_0
cudatoolkit=11.1.1=h6406543_8
cycler...
Okay, I will increase it and try again.
Unfortunately, not. Quick question: Is there caching happening somewhere besides .clearml
? Does the boto3 driver create cache?
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
Ah, it actually is also a string with remote_execution, but still not what it should be.
Alright, thank you. I will try to debug further
Can You tell me which python version is running on the agent/docker and which docker image?
Perfect, thanks! Only issue that is left, is that it seems like .ssh
is used even when I provideSSH_AUTH_SOCK
. I created an issue here: https://github.com/allegroai/clearml-agent/issues/45
Okay, I didn't know that. I just saw that VSCode seems to use a similar setup for their docker devcontainers.
Is there a way for me to configure/add the run arguments for the docker run
call?
What exactly do you mean by docker run permissions?
Yes, but this seems pretty reasonable to assume imo.
I am using https://hub.docker.com/layers/nvidia/cuda/11.8.0-base-ubuntu22.04/images/sha256-88b85c6edd089acdf0cb7f3be020a1e812b009bafaf92c1715ab6677bd997ef1?context=explore
which has python 3.10.6 if I remember correctly.
Hi TimelyMouse69 Thank you for your answer.
I use 3.10.8 locally and 3.10.6 remotely. Everything is run in a docker container, locally and remotely on the docker-agent (exactly the same docker image).
Thank you for looking into the disappearing dev
. It seems like this should be the reason for pip trying to install a stable version of 1.14, which does only exist as nightly
Bonus question: Is there some clearml-agent mode that does not do "some magic" and instead just installs exactly what is shown in the "INSTALLED PACKAGES" editor in the web UI?
Here is some code that shows exactly what goes wrong. I do local execution only. It seems not to be related to remote execution as I thought, but more related to clearml.Task:
` args = parser.parse_args()
print(args) # FIRST OUTPUT
command = args.command
enqueue = args.enqueue
track_remote = args.track_remote
preset_name = args.preset
type_name = args.type
environment_name = args.environment
nvidia_docker = args.nvidia_docker
# Initialize ClearML Tas...
Tested with clearml-agent 1.0.1rc4/1.2.2 and clearml 1.3.2
I just wanna avoid that ClearML leaves files lingering around. Btw: a better default behavior in my opinion would be to delete tasks only after files have been deleted. And only with the force option to delete the task anyways!
I will read up on the services documentation then. Thank you very much for the help 🙂
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation ...
I see. Thank you very much. For my current problem giving priority according to queue priority would kinda solve it. For experimentation I will sometimes enqueue a task and then later enqueue a another one of a different kind, but what happens is that even though this could be trivially solved, I will have to wait for the first one to finish. I guess this is only a problem for people with small "clusters" where SLURM does not make sense, but no scheduling at all is also suboptimal.
However, I...
Makes sense, but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
@<1523701435869433856:profile|SmugDolphin23> Good catch. I have a good but unsatisfying message for you guys: I restarted the whole machine (server and agent) and now it works fine ...