yes that makes send, I think what happened is one of the processes completed the Task (i.e. closed it) before the others did, and so they threw exception.
I switched to have all tasks in a separate process
I think that's probably the best (performance wise as well), nice!
Hm, one of the issues I have with this change is that now every dataset hat doesn’t have a semantic version cannot be loaded anymore
Okay we definitely need to solve that.
Any chance I can ask to open a github issue (just so we do not forget).
I will pass it quickly along so that we can maybe offer a fix in the next RC
Seems that api has changed pretty much since a few versions back.
Correct, notice that your old pipelines Tasks use the older package and will still work.
There seems to be no need in
controller_task
anymore, right?
Correct, you can just call pipeline.start()
🙂
The pipeline creates the tasks, but never executes/enqueues them (they are all in
Draft
mode). No DAG graph appears in
RESULTS/PLOTS
tab.
Which vers...
Do we support GPUs in a) docker mode b) k8s glue?
yes on both
Is there a good reference to get started with k8s glue?
A few folks here already set it up, do you have a k8s cluster with GPU support ?
LazyLeopard18 are you using the StorageManager to access azure:// links?
So how do I solve the problem? Should I just relaunch the agents? Because they can't execute jobs now
Are you running in docker mode ?
If so you can actually delete mapped files (they will still be available inside the docker), just make sure you delete them X hours after they were created, and you should be fine.
wdyt?
I have the agent configured to force install requirements.txt
what do you mean by that?
If you edit the requirements to have
https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl
No, I mean actually compare using the UI, maybe the arguments are different or the "installed packages"
Is there a reason
clearml
will use the demo server when there is no
~/clearml.conf
?
It's the default server for easy getting started journey, e.g. you run some sample code and it works , with zero configuration.
that said you can set an environment flag to disable the default server behavior .CLEARML_NO_DEFAULT_SERVER=1
ReassuredTiger98
wdyt?
BTW:
it will push potentially proprietary data to the public demo server.
The server if su...
I was just able to reproduce with "localhost"
But this is not copy, this is mount, your log showed cp failing
Can you also make sure you did not check "Disable local nachine git detection" in the clearml PyCharm plugin?
If nothing specific comes to mind i can try to create some reproducible demo code (after holiday vacation)
Yes please! 🙏
In the mean time see if the workaround is a valid one
AstonishingSeaturtle47 , makes sense?
Hi @<1661542579272945664:profile|SaltySpider22> I'm not sure I understand the answer to my parallel quesion
I have no idea what string reference could be used when steps come from Task?
Oh I see, you are correct, when it comes to Tasks the assumption is your are passing strings (with selectors on the strings, i.e. the curly brackets) but there is no fancy serialization/deserialization as you have with pipelines from decorators / functions. The reason for that is that the Task itslef is a standalone, there is no way for the pipeline logic to actually "pull data" from it and "pass" it to the o...
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
Hi @<1576381444509405184:profile|ManiacalLizard2>
If you make sure all server access is via a host name (i.e. instead of IP:port, use host_address:port), you should be able to replace it with cloud host on the same port
FreshParrot56 we could add this capability, but the main caveat is that f your version depends on multiple parent versions you still need to download and extract all the parent versions, which means that when you clear them you might hurt later performance. Does that make sense? What is the use-case / scenario for you?
That would be great! Might have to use
2>/dev/null
in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd
inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
Gitlab has support for S3 based cache btw.
This might still be considered "slow" compared to local-dist/cluster mount
Would adding support for some sort of post task script help? Is something already there?
Interesting, can you expand on the use case? (currently there is only pre-task script, for setup)
Basically if I pass an arg with a default value of False, which is a bool, it'll run fine originally, since it just accepted the default value.
I think this is the nargs="?"
, is that right ?
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?
Sure thing!
BTW: not sure if it helps but the SaaS version integrates with Genesis Cloud I know they provide cheap GPUs might be worth checking
See here:
https://download.pytorch.org/whl/torch_stable.html
cu110/* has no torch 1.3.1 only 1.7.0