So only short update for today: I did not yet start a run with conda 4.7.12.
But one question: Actually conda can not be at fault here, right? I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)
Or there should be an early error for trying to run conda based tasks on pip agents
Can you ping me when it is updated in None so I can update my installation?
Would it help you diagnose this problem if I ran conda env create --file=environment.yml
and see whether it works?
From the logs when ran with --foreground I
I do not see any conda create
command.
This my environment installed from env file. Training works just fine here:
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
Thank you! I agree with CostlyOstrich36 that is why I meant false sense of security 🙂
Thank you SuccessfulKoala55 so actually only the file-server needs to be secured.
Perfect! That sounds like a good solution for me.
Thank you. Seems like this is not the best solution: https://serverfault.com/questions/132970/can-i-automatically-add-a-new-host-to-known-hosts#comment622492_132973
AgitatedDove14 I have to problem that "debug samples" are not shown anymore after running many iterations. What's appropriate to use here: A colleague told me increasing task_log_buffer_capacity
worked. Is this the right way? What is the difference to file_history_size
?
Thanks for answering. I don't quite get your explanation. You mean if I have 100 experiments and I start up another one (experiment "101"), then experiment "0" logs will get replaced?
MortifiedDove27 Sure did, but I do not understand it very well. Else I would not be asking here for an intuitive explanation 🙂 Maybe you can explain it to me?
I have a related question: I read here that 4GB is a http limitation and ClearML will not chunk single files. I take from that, that ClearML did not want/there was no need to implement an own solution so far. But what about models that are larger than 4GB?
Yea, something like this seems to be the best solution.
Let me try it another time. Maybe something else went wrong.
name: core
channels:
- pytorch
- anaconda
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.10.14
- certifi=2020.6.20
- cloudpickle=1.6.0
- cudatoolkit=11.1.1
- cycler=0.10.0
- cytoolz=0.11.0
- dask-core=2021.2.0
- decorator=4.4.2
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- imageio=2.9.0
- jpeg=9b
- kiwisolver=1.3.1
- lame=3.100
- lcms2=2.11
-...
` args = parser.parse_args()
print(args) # args PRINTED HERE ON LOCAL
command = args.command
enqueue = args.enqueue
track_remote = args.track_remote
preset_name = args.preset
type_name = args.type
environment_name = args.environment
nvidia_docker = args.nvidia_docker
# Initialize ClearML Task
task = (
Task.init(
project_name="reinforcement-learning/" + type_name,
task_name=args.name or preset_name,
tags=...
Good, at least now I know it is not a user-error 😄
So missing args that are not specified are not None
like intended, but just do not exists in args
. And command is a list instead of a single str.
If you compare the two outputs it put at the top of this thread, the one being the output if executed locally and the other one being the output if executed remotely, it seems like command
is different and wrong on remote.