PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
Maybe failed pipelines with zero steps count as completed
zero steps counts as successful.
That said, how could it have zero steps if one of the steps failed? no?
if fails during
add_step
stage for the very first step, because
task_overrides
contains invalid keys
I see, yes I guess it it makes sense to mark the pipeline as Failed 🙂
Could you add a GitHub issue on this behavior, so we do not miss it ?
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
SmarmySeaurchin8 could you test with the latest RCpip install clearml==0.17.5rc2
Thanks! a few thoughts below 🙂
- not true — you can specify the image you want for each stepMy apologies, looking at the release notes, it was added a while back and I have not noticed 😞
- re: role-base access control - see Outerbounds Platform that provides a layer of security and auth features required by enterprisesRole based access meaning limiting access in metaflow i.e. specific users/groups can only access specific projects etc. ...
An upload of 11GB took around 20 hours which cannot be right.
That is very very slow this is 152kbps ...
JitteryCoyote63 I think there is a ClearML logger , no?
Hi CheerfulGorilla72
see
Notice all posts on that channel are @ channel 🙂
I just set the git credentials in the
clearml.conf
and it works out of the box
git has issues with passing the user/token from the main repo to the submodules, hence my surprise that it is working out-of-the-box.
Do notice that if you are ussing ssh-key this is a none issue.
Nope, no
.netrc
defined anywhere, ...
If this is the case can you try to add the following to your "extra_vm_bash_script"
` echo machine example.com > ~/.netrc && echo log...
Now I'm curious what's the workaround ?
that is odd..
So if you have 3 agents, how many concurrent experiment are they running ? (actually running, not registered as running)
What's the matplotlib version ? and python version?
JitteryCoyote63
Picks a new experiment on top of the long one running
This is very very strange. Is the long running experiment being logged (i.e. do you still see console output in the UI)?
Yup, I just wanted to mark it completed, honestly. But then when I run it, Colab crashes.
task.close()
will do that
BTW what's the exception you are getting ?
Hi PleasantGiraffe85
Did you set git_host
to only point to your host ? do you expect all the git clones to use SSH? how does the requirements.txt git link looks like ?
https://github.com/allegroai/clearml-agent/blob/bf07b7f76d3236c1118b81730c6d9718705a795a/docs/clearml.conf#L22
NastyOtter17 can you provide some more info ?
how to put or handle this configuration and where?
In your clearml.conf on the machine with the agent just add at the bottom of the file agent.venvs_cache.path=~/.clearml/venvs-cache
okay, let me check it, but I suspect the issue is running over SSH, to overcome these issues with pycharm we have specific plugin to pass the git info to the remote machine. Let me check what we can do here.
FiercePenguin76 BTW, you can do the following to add / update packages on the remote sessionclearml-session --packages "newpackge>x.y" "jupyterlab>6"
Yeah the ultimate goal I'm trying to achieve is to flexibly running tasks for example before running, could have a claim saying how many resources I can and the agent will run as soon as it find there are enough resources
Checkout Task.execute_remotely()
you can push it anywhere in your code, when execution get to it, If you are running without an agent it will stop the process and re-enqueue it to be executed remotely, on the remote machine the call itself becomes a noop,
I...
EnviousPanda91 notice that when passing these arguments to clearml-agent you are actually passing default args, if you want an additional argument to Always be used, set the extra_docker_arguments
here:
https://github.com/allegroai/clearml-agent/blob/9eee213683252cd0bd19aae3f9b2c65939d75ac3/docs/clearml.conf#L170
Hi PompousBeetle71
Try this one, let me know if it helpedlogging.getLogger('trains.frameworks').setLevel(ERROR)