Reputation
Badges 1
25 × Eureka!Hi ProudMosquito87
My apologies there is still no concrete ETA ...
That said I think a good toy example would really help accelerate this process.
How about opening a PR with a nice hydra example, then we can start discussing implementation details based on the toy example ?
Hmm is this similar to this one https://allegroai-trains.slack.com/archives/CTK20V944/p1597845996171600?thread_ts=1597845996.171600&cid=CTK20V944
LovelyHamster1 verified, this is a UI bug with old limitation enforced.
I will make sure they know about it, it should be fixed for the upcoming release 🙂
Nothing except that Draft makes sense feels like the task is being prepped and Aborted feels like something went wrong
Yes guess that if we call execute remotely, without a queue, it makes sense for you to edit it...
Is that the case TrickySheep9 ?
If it is I think we should change it to draft when it is not queued. sounds good to you guys ?
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is 🙂
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunate...
MagnificentSeaurchin79 making sure the basics work.
Can you see the 3D plots under the Plot section ?
Regrading the Tensors, could you provide a toy example for us to test ?
okay, let me check it, but I suspect the issue is running over SSH, to overcome these issues with pycharm we have specific plugin to pass the git info to the remote machine. Let me check what we can do here.
FiercePenguin76 BTW, you can do the following to add / update packages on the remote sessionclearml-session --packages "newpackge>x.y" "jupyterlab>6"
Assuming you are using docker-compose, the console output is a good start
Where do you store those ?
Set it on the PID of the agent process itself (i.e. the clearml-agent python process)
LudicrousParrot69 I would advise the following:
Put all the experiments in a new project Filter based on the HPO tag, and sort the experiments based on the metric we are optimizing (see adding custom columns to the experiment table) And select + archive the experiments that are not usedBTW: I think someone already suggested we do the auto archiving inside the HPO process itself. Thoughts ?
If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged
PompousBeetle71 If this is argparser and the type is defined, the trains-agent will pass the equivalent in the same type, with str
that amounts to '' . make sense ?
Thanks MuddyCrab47 !!!
I found it!
It turns out the artifact upload will always upload from stream (aka no multi-upload). I will make sure we fix it in the next RC (I think the plan is to have it out this weekend)
yes, looks like. Is it possible?
Sounds odd...
Whats the exact project/task name?
And what is the output_uri?
Let me know if there is an issue 🙂
JitteryCoyote63 try to add the prefix to the parameter name, e.g. instead of "artifact_name" use "Args/artifact_name"
Hi SteadyFox10
Short answer no 😞
Long answer, full permissions are available in the paid tier, along side a few more advanced features.
Fortunately in this specific use case, the community service allows you to share a single (or multiple) experiments with a read-only link. Would that work ?
CooperativeFox72 I would think the easiest would be to configure it globally in the clearml.conf (rather than add more arguments to the already packed Task.init) 🙂
I'm with on 60 messages being way too much..
Could you open a Github Issue on it, so we do not forget ?
ok, but this happens in my local machine, not in the agent
resource monitoring is always running in the background, even on local machines. (of course you can turn it off)
What's the matplotlib version ? and python version?
I'm wondering why this is the case as docker best practices does indicate we should use a non root on production images.
The docker image for the service-agent is not root ?
Hi AbruptWorm50
the second "epoch loss" is the scalar for the "validation" process (see "validation: epoch loss" series is actually the TF file/folder prefix automatically added)
Make sense ?
I think we were able to fix it, let me check if it was pushed 🙂
BTW: how is it missing listing torch
? Do you have "import torch" in the code ?