Reputation
Badges 1
606 × Eureka!I have set default_output_uri
to s3://my_minio_instance:9000/clearml
If I set files_server
to s3://my_minio_instance:9000 /bucket_that_does_not_exist
it fails at uploading metrics, but model upload still works:
WARNING - Failed uploading to
s3://my_minio_instance:9000/ bucket_that_does_not_exist
('NoneType' object has no attribute 'upload')
clearml.Task - INFO - Completed model upload to
s3://my_minio_instance:9000/clearml
What is ` default_out...
It is not explained there, but do you meanCLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-} CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
?
How can I get the agent log?
Now the pip packages seems to ship with CUDA, so this does not seem to be a problem anymore.
I was wrong: I think it uses the agent.cuda_version
, not the local env cuda version.
Do you mean venv_update
?
In my case I use the conda freeze option and do not even have CUDA installed on the agents.
I just updated my server to 1.0 and now the services agent is stuck in restarting:
Ah, very cool! Then I will try this, too.
I see. Thanks a lot!
I see. I was just wondering what the general approach is. I think PyTorch used to ship the pip package without CUDA packaged into it. So with conda it was nice to only install CUDA in the environment and not the host. But with pip, you had to use the host version as far as I know.
I got the idea from an error I got when the agent was configured to use pip and tried to install BLAS (for PyTorch I guess) and it threw an error.
However, to use conda as package manager I need a docker image that provides conda.
I use fixed users!
` ocker-compose ps
Name Command State Ports
clearml-agent-services /usr/agent/entrypoint.sh Restarting
clearml-apiserver /opt/clearml/wrapper.sh ap ... Up 0.0.0.0:8008->8008/tcp, 8080/tcp, 8081/tcp ...
Ah, perfect. Did not know this. Will try! Thanks again! 🙂
Okay, it seems like it just takes some time to delete and to reflect in the WebUI. So when I try to delete again, actually a deletion process seems already to be running in the background.
I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619
I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄
Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.
Maybe deletion happens "async" and is not reflected in parts of clearml? It seems that if I try to delete often enough at some point it is successfull
It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors