Reputation
Badges 1
611 × Eureka!I mean if I do CLEARML_DOCKER_IMAGE=my_image clearml-task something something it will not work, right?
My clearml-server server crashed for some reason, so I won't be able to verify until tomorrow.
Oh you are right. I did not think this through... To implement this properly it gets to enterprisy for me, so I ll just leave it for now :D
I was wondering whether some solution is builtin in clearml, so I do not have to configure each server manually. However, from your answer I take that this is not the case.
Yea, something like this seems to be the best solution.
I have no idea myself, but what the serverfault thread says about man-in-the-middle makes sense. However this also prohibits an automatic solution except for a shared known_hosts file I guess.
Latest version for everything. I will message you again, if I encounter this problem again.
It is not explained there, but do you meanCLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-} CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}?
If you compare the two outputs it put at the top of this thread, the one being the output if executed locally and the other one being the output if executed remotely, it seems like command is different and wrong on remote.
` ocker-compose ps
Name Command State Ports
clearml-agent-services /usr/agent/entrypoint.sh Restarting
clearml-apiserver /opt/clearml/wrapper.sh ap ... Up 0.0.0.0:8008->8008/tcp, 8080/tcp, 8081/tcp ...
AgitatedDove14 Thank you, that explains it.
I mean, could my hard drive not become full at some point? Can clearml-agent currently detect this?
For me this does not work (at least with nested tqdm bars, did not try single ones yet).
When I passed specific arguments (for example --steps) it ignored them...
I am not sure what you mean by this. It should not ignore anything.
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation ...
I got some warnings about broken packages. I cleaned the conda cache with conda clean -a ` and now it installed fine!
Is ther a way to see the contents of /tmp/conda_envaz1ne897.yml ? Seems to be deleted after the task is finihsed
Here it is
This is the error I get from setting the logger upload destination.botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
Is there a clearml.conf for this agent somewhere?
Okay, great! I just want to run the cleanup services, however I am running into ssh issues so I wanted to restart it to try to debug.
I have set default_output_uri to s3://my_minio_instance:9000/clearml
If I set files_server to s3://my_minio_instance:9000 /bucket_that_does_not_exist it fails at uploading metrics, but model upload still works:
WARNING - Failed uploading to s3://my_minio_instance:9000/ bucket_that_does_not_exist ('NoneType' object has no attribute 'upload')
clearml.Task - INFO - Completed model upload to s3://my_minio_instance:9000/clearml
What is ` default_out...
Or maybe a different question: What is not
Artifacts and Models. debug samples (or anything else the Logger class creates)
?
Also it is not possible to use multiple files server? E.g. log tasks on different S3 buckets without changing clearml.conf