Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip>
?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
I'm not sure TB support confusion matrix regardless, from anywhere in your code you can do:from trains import Task Task.current_task().get_logger().report_confusion_matrix(...)
Hi CleanPigeon16
I was wondering how (or if) you handle interruptions.
Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in you...
Oh yes, you probably have sorting or filter applies there :)
Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
Yes that makes total sense to me. How about a GitHub issue on the clearml-docs ?
The pod has an annotation with a AWS role which has write access to the s3 bucket.
So assuming the boto environment variables are configured to use the IAM role, it should be transparent, no? (I can't remember what the exact envs are, but google will probably solve it 🙂 _
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. I was expecting clearml to pick them by default from the environment.
Yes it should, the OS env will always override the configuration file sect...
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
The problem is that even when I mount the SSH key into the root home directory (e.g.,
/root/.ssh/id_rsa
with the correct permissions set to 400) I still encounter the same error.
The agent automatically mount's the .ssh folder from the host into the container, making sure all the permissions are set,
how can I run
pip install -e .
in general the agent will add the "working" dir into the PYTHONPATH so that you should not have to manually do "-e ."
Tha...
but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
How so?
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
So this is an additional config file with enterprise?
Extension to the "clearml.conf" capabilities
Is this new config file deployable via helm charts?
Yes, you can also set it company/user wide using the clearml Vault feature (again enterprise, sorry 😞 )
It seems to fail when trying to download the modellocal_download = StorageManager.get_local_copy(uri, extract_archive=False) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/manager.py", line 47, in get_local_copy cached_file = cache.get_local_copy(remote_url=remote_url, force_download=force_download) File "/opt/venv/lib/python3.7/site-packages/clearml/storage/cache.py", line 55, in get_local_copy if helper.base_url == "file://":
And based on the error I suspect the...
Thanks CleanPigeon16
Could you verify Task "d1d361d1059c4f0981200f59d7683773" exists (and not archived)?
. I can't find any actual model files on the server though.
What do you mean? Do you see the specific models in the web UI? is the link valid ?
So dynamic or static are basically the same thing, just in dynamic, I can edit the artifact while running the experiment?
Correct
Second, why would it be overwritten if I run a different run of the same experiment?
Sorry, I meant in the same run, if you reuse the artifact name you will be overwriting it. Obviously different runs different artifacts :)
RoughTiger69 how did you end up with a Task with just "origin" in the repo field ?
JitteryCoyote63 what's the clearml
version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?
Hi @<1535069219354316800:profile|PerplexedRaccoon19>
What do you mean by simulate?
You can manually setup and run a Task if you need,
'clearml-agent execute --id task_id' add --docker for docker mode.
This will setup the env and run the task
Is gpu_0_utilization also in % then?
Correct 🙂
I was trying to find, what are those min and max value for above metrics.
Oh that makes sense, notice that you can get the values over time, so you can track the usage over the experiment lifetime (you can of course see it in the Scalar tab of the experiment)
Not sure why, but for some reason it seems it is failing to analyze the code, hence the warning and no packages...
Any other hints on your setup that might help to better understand the root cause ? maybe home folder with unicode characters ? python installed in a specific way?
HealthyStarfish45 what exactly did you have in mind, in terms of the widget ?
So was definitely related to the symlinks in some form
could it be it actually deleted the cache? How many agents are running on the same machine ?
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.
This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there