Reputation
Badges 1
606 × Eureka!Thank you for the quick reply. Maybe anyone knows whether there is an option to let docker delete images after container exit?
clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
But this seems like something that is not related to clearml 🙂 Anyways, thanks again for the explanations!
Yea, I am still trying to get docker to work with clearml. I do not have much experience with docker besides creating Dockerfiles and it seems like the ~/.ssh/config
ownership seems broken when mounted into the container on my workstations.
Okay, thanks for explaining!
Thank you, perfect! I did not try yet, but wil do now.
Yes, I do not want to rely on the clearml-agent. Afaik the clearml-sdk in the container does the downloading and since a host directory is mounted, it is mirrored there. If it was possible to not mount the host directory, everything would be contained 🙂
agent-forwarding is working just like your described here: https://github.com/allegroai/clearml-agent/issues/45 Looking forward to not having to use the absolute path in the future 🙂
Yea, is there a guarantee that the clearml-agent will not crash because it did not clean the cache in time?
The docker run command of the agent includes '-v', '/tmp/clearml_agent.ssh.8owl7uf2:/root/.ssh'
and the file permissions are exactly the same.
Anyways, from my google search it seems that this is not something that is intuitive to fix.
Is there any progress on this: https://github.com/allegroai/clearml-agent/issues/45 ? This works on all my machines 🙂
Yea. Not using the config file does not seem like a good long-term solution for me. However, I still have no idea, why this error happens. But enough for today. Thank you a lot for your help!
As in if it was not empty it would work?
Maybe the problem is that I do not start my docker containers from the root
user, so 1001
is a mapping inside the docker to my actual user. Could it be that on the host the owner if your .ssh
files is called root
?
First one is the original, second one the clone
I am currently on the move, but it was something like upstream server not found in /etc/nginx/nginx.conf and if I remember correctly line 88
Makes sense, but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
I can put anything there: s3://my_minio_instance:9000 /bucket_that_does_not_exist
and it will work.
I see. Thank you very much. For my current problem giving priority according to queue priority would kinda solve it. For experimentation I will sometimes enqueue a task and then later enqueue a another one of a different kind, but what happens is that even though this could be trivially solved, I will have to wait for the first one to finish. I guess this is only a problem for people with small "clusters" where SLURM does not make sense, but no scheduling at all is also suboptimal.
However, I...
I just updated my server to 1.0 and now the services agent is stuck in restarting:
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
I was wrong: I think it uses the agent.cuda_version
, not the local env cuda version.
@<1523701087100473344:profile|SuccessfulKoala55> I just did the following (everything locally, not with clearml-agent)
- Set my credentials and S3 endpoint to A
- Run a task with Task.init() and save a debug sample to S3
- Abort the task
- Change my credentials and S3 endpoint to B
- Restart the taskThe result are lingering files in A that seem not to be associated with the task. I would expect ClearML to instead error the task or to track the lingering files somewhere, so they can ma...
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation ...
Alright, thanks. Would be a nice feature 🙂
For example I get the following error if I simply clone and rerun:ERROR: Could not find a version that satisfies the requirement ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28)) (from versions: none) ERROR: No matching distribution found for ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28))
It is weird though. The task is submitted by the original user and then run on the agent. The task however is still registered by the original user, since it is created by the original user.
Makes more sense to just inherit the user from the task than from the agent?