data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Hi @<1523711619815706624:profile|StrangePelican34>
Hmm, I think this is missing from the docs, let me ping the guys about that 🙏
Actually what my service do is to collect
stdout/stderr
from the Docker socket
That's exactly how the agent works, it cannot really filter it, it logs everything by default for full visibility ...
. Looking at this example here, it looks like it only works with tasks:
Aha! Pipeline is a Task 🙂 (a specific type of Task, nonetheless a Task)
Just use the pipeline ID, and make sure you push it into the services queue, voila
Hmm do you host it somewhere? Is it pre-installed on the container?
Yes, but only with git clone 🙂
It is not stored on ClearML, this way you can work with the experiment manager without explicitly giving away all your code 😉
do you have your Task.init
call inside the "train.py" script ? (and if you do, what are you getting in the Execution tab of the task) ?
Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip>
?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
Hi CleanPigeon16
I was wondering how (or if) you handle interruptions.
Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in you...
Oh yes, you probably have sorting or filter applies there :)
Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
Yes that makes total sense to me. How about a GitHub issue on the clearml-docs ?
The pod has an annotation with a AWS role which has write access to the s3 bucket.
So assuming the boto environment variables are configured to use the IAM role, it should be transparent, no? (I can't remember what the exact envs are, but google will probably solve it 🙂 _
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. I was expecting clearml to pick them by default from the environment.
Yes it should, the OS env will always override the configuration file sect...
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
The problem is that even when I mount the SSH key into the root home directory (e.g.,
/root/.ssh/id_rsa
with the correct permissions set to 400) I still encounter the same error.
The agent automatically mount's the .ssh folder from the host into the container, making sure all the permissions are set,
how can I run
pip install -e .
in general the agent will add the "working" dir into the PYTHONPATH so that you should not have to manually do "-e ."
Tha...
but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
How so?
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
So this is an additional config file with enterprise?
Extension to the "clearml.conf" capabilities
Is this new config file deployable via helm charts?
Yes, you can also set it company/user wide using the clearml Vault feature (again enterprise, sorry 😞 )
It seems to fail when trying to download the modellocal_download = StorageManager.get_local_copy(uri, extract_archive=False) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/manager.py", line 47, in get_local_copy cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/cache.py", line 55, in get_local_copy if helper.base_url == "file://":
And based on the error I suspect the...
Thanks CleanPigeon16
Could you verify Task "d1d361d1059c4f0981200f59d7683773" exists (and not archived)?
. I can't find any actual model files on the server though.
What do you mean? Do you see the specific models in the web UI? is the link valid ?
So dynamic or static are basically the same thing, just in dynamic, I can edit the artifact while running the experiment?
Correct
Second, why would it be overwritten if I run a different run of the same experiment?
Sorry, I meant in the same run, if you reuse the artifact name you will be overwriting it. Obviously different runs different artifacts :)
RoughTiger69 how did you end up with a Task with just "origin" in the repo field ?
JitteryCoyote63 what's the clearml
version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?
Hi @<1535069219354316800:profile|PerplexedRaccoon19>
What do you mean by simulate?
You can manually setup and run a Task if you need,
'clearml-agent execute --id task_id' add --docker for docker mode.
This will setup the env and run the task