Hi GreasyRaven35
You should set the output_uri, in Task init, it will auto upload the model, and register the remote location URLtask = Task.init(..., output_uri=True)
You can also specify a target bucket, if you configured credentials (e.g. output_uri=" s3://bucket ")
BoredHedgehog47
is this ( https://clearml.slack.com/archives/CTK20V944/p1665426268897429?thread_ts=1665422655.799449&cid=CTK20V944 ) the same issue (or solution) ?
LazyTurkey38
The last part makes sense, not sure I get the "if clone", we are calling execute_remotely, so I'm assuming we do not need to clone ourselves, but send the current Task.
Other than that yes, makes sense (BTW, assuming you have upgraded the server >=1.0 you can just do mark_stopped, no need to reset
See if this helps
Hi FierceHamster54
I'm this is solvable, get in touch with them either in the contact form on the website or email support@clear.ml , should not be complicated to fix 🙂
Oh that is odd... let me check something
I think, this all ties into the none-standard git repo definition. I cannot find any other reason for it. Is it actually stuck for 5 min at the end of the process, waiting for the repo detection ?
task.models["outputs"][-1].tags
(plural, a list of strings) and yes I mean the UI 🙂
I get the n_saved
what's missing for me is how would you tell the TrainsLogger/Trains the current one is the best? Or are we assuming the last saved model is always the best ? (in that case there is no need for tag, you just take the last in the list)
If we are going with: "I'm only saving the model if it is better than the previous checkpoint" then just always use the same name i.e. " http:/...
CostlyOstrich36 did you manage to reproduce it?
I tried conda w/ python3.9 on a clean Windows VM , and it worked as expected ....
Okay, some progress, so what is the difference ?
Any chance the issue can be reproduced with a small toy code ?
Can you run the tqdm loop inside the code that exhibits the CR issue ? (maybe some initialization thing that is causing it to ignore the value?!)
DefiantHippopotamus88HTTPConnectionPool(host='localhost', port=8081):
This will not work because inside the container of the second docker compose "fileserver" is not definedCLEARML_FILES_HOST="
"
You have two options:
configure to the docker compose to use the networkhost on all conrtainers (as oppsed to the isolated mode they are now running ing)2. Configure all of the CLEARML_* to point to the Host IP address (e.g. 192.168.1.55) , then rerun the entire thing.
No, I mean actually compare using the UI, maybe the arguments are different or the "installed packages"
DefeatedOstrich93 many thanks I was able to reproduce it (basically newly added files caused git apply to fail)
Fix will be part of the next clearml-agent RC
Thanks DefeatedOstrich93
Let me check if I can reproduce it.
PungentLouse55 hmmm
Do you have an idea on how we could quickly reproduce it?
Hi @<1541954607595393024:profile|BattyCrocodile47>
I
do
have the SSH key placed at
/root/.ssh/id_rsa
on the machine,
Notice that the .ssh folder is mounted from the host (EC2 / GCP) into the container,
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh'
This is odd, why is it mounting it to /.ssh and not /root/.ssh ?
I do not think this is the upload timeout, it makes no sense to me for GCP package (we do not pass any timeout, it's their internal default for the argument) to include a 60sec timeout for upload...
I'm also not sure where is the origin of the timeout (I'm assuming the initial GCP handshake connection could not actually timeout, as the response should be relatively quick, so 60sec is more than enough)
Isn't that risky? not knowing you need a package ?
How do you actually install it on the remote machine with the agent ?
But I'm sure there is a cleaner way to proceed.
Maybe ?!path = task.get_output_destination().replace('file://', '', 1)
Hi ObnoxiousStork61
but unfortunately I can't fetch them from my local computer,
is this intended?
By default ClearML will only log the wights files.
It can also automatically upload them, if you pass a destination for storage at Task.init/
For example, to store on the files server:Task.init(..., output_uri=True)
To store on S3 (sub folders will be created automatically based on the Task IDTask.init(..., output_uri='
')
Hi SmugLizard24
The question is what is the reason of the issue?
That is a good question, could it be out of memory? (trying to compress or send the file in one chunk?)
The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
Ch...
TrickySheep9 Yes, let's do that!
How do you PR a change ?
Hi MagnificentSeaurchin79
Could you test with the tesnorflow toy example?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py
I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
LazyTurkey38 the created venv inherits from the system environment, so in theory you can do all the installation on the system python and the created venv will just inherit the packages, no?
(btw: just to clarify, there is only one entry point for the custom bash script and that is before everything, so users can configure the container before the agent starts)
Can you verify it fixes the timeout issue as well? (or some insight on how to reproduce the issue?)
Hi ZippySheep23
Any ideas what might be happening?
I think you passed the upload limit (2.36 GB) 🙂