Reputation
Badges 1
606 × Eureka!@<1523701994743664640:profile|AppetizingMouse58> Thank you very much. I forgot the volume mapping.
So can I just add the config to the async_delete container and mirror the directory structure from github?
volumes:
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/logs:/var/log/clearml
Hi @<1523701087100473344:profile|SuccessfulKoala55> Thank you very much.
Is there some way to verify the server uses the correct configuration files? (E.g. see it in the logs/web ui). I Just tried it does not work.
At least I can see the async_delete service complains about a missing secret, so I can start debugging there. I am using the same config as for my agents, but somehow for async_delete it does not work...
And the files that I see on github are the default configuration of the server, even if I do not have these files in my installation, right?
It seems like this is a bug however or is something like this to be expected? There shouldn't be files that are not shown in the WebUI..?
Based on https://github.com/lanpa/tensorboardX/blob/34d1616c035faaa0f3f7c9d19cb8bb4425f19939/tensorboardX/summary.py#L355 I would guess that it is already encoded before added to the tensorboard summary.
I am still trying to solve the add_requirements
+ importlib
combo. If I use detect_with_freeze
I can not use add_requirements
and if I use automatic code analysis it will not find all packages because of importlib
.
For now I come to the conclusion, that keeping a requirements.txt
and making clearml parse the requirements from there should be the most robust solution. Unfortunately, there seems to be no way to do this with Task.init
.
Or alternatively I just saw that Task.create
takes a requirements.txt
as an argument. This would also be fine for me, however I am not sure whether I should use Task.create
?
AlertBlackbird30 Thanks for asking. Just take everything with I grain of salt I say, because I am also not sure whether I do machine learning the correct way 😄
I think you got the right idea. I actually do reinforcement learning (RL), so I have multiple RL-environments and RL-agents. However, while the code for the agents differs between the agents, the glue code is the same. So what I do is I call python run_experiment.py --agent
http://myproject.agents.my ` _agent --environm...
But you can manually add them with Task.add_requirements, no?
In my opinion an ugly solution. I would have to keep track of which requirements are missing. Then I would rather just add all requirements manually.
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff...
drwxr-xr-x 10 root root 4096 Jul 31 2020 .
drwxr-xr-x 14 root root 4096 Jul 31 2020 ..
drwxr-xr-x 2 root root 4096 Feb 4 13:52 bin
drwxr-xr-x 2 root root 4096 Jul 31 2020 etc
drwxr-xr-x 2 root root 4096 Jul 31 2020 games
drwxr-xr-x 2 root root 4096 Jul 31 2020 include
drwxr-xr-x 4 root root 4096 Feb 3 13:40 lib
lrwxrwxrwx 1 root root 9 Dez 10 14:29 man -> share/man
drwxr-xr-x 2 root root 4096 Jul 31 2020 sbin
drwxr-xr-x 7 root root 4096 Jul 31 2020 share
drwxr-xr-x ...
One question: Does clearml resolve the CUDA Version from driver or conda?
Okay. It works now. I don't know what went wrong before. Probably a user error 😅
When experimenting we use a entrypoint script which we pass the specific experiment to.
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
I installed my local conda environment from an environment.yml
without issues, so maybe clearml makes some changes that leads to conflicts which finally leads to the cpu-version install.
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit==11.1.1
- pytorch==1.8.0
Gives CPU version
Interesting. Will probably only matter for very small experiments or experiments, where validation is run very infrequently.
What's the reason for the shift?
Hi KindChimpanzee37 I was more asking about the general idea to make these settings task-specific, but thank you for the suggestion anyways, I will definitely apply it.