Reputation
Badges 1
25 × Eureka!yes, looks like. Is it possible?
Sounds odd...
Whats the exact project/task name?
And what is the output_uri?
SubstantialElk6 it seems the auto resolve of pytorch cuda failed,
What do you have in the "installed packages" section?
Ohh SubstantialElk6 please use agent RC3, (latest RC is somewhat broken sorry, we will pull it out)
Oh that's definitely off 🙂
Can you send a quick toy snippet to reproduce it ?
Hi ProudChicken98
How about saving it as a local YAML and upload the file itself as an artifact?
Hi @<1532532498972545024:profile|LittleReindeer37>
Does Hydra support notebooks ? If it does, can you point to an exapmle?
Hi @<1533620191232004096:profile|NuttyLobster9>base_task_factory
is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
return task
Make sense ?
Actually, dumb question: how do I set the setup script for a task?
When you clone/edit the Task in the UI, under Execution / Container you should have it
After you edit it, just push it into the execution with the autoscaler and wait 🙂
LudicrousDeer3 when using Logger you can provide 'iteration' argument, is this what you are looking for?
Hi JitteryCoyote63
So that I could simply do
task._update_requirements(".[train]")
but when I do this, the clearml agent (latest version) does not try to grab the matching cuda version, it only takes the cpu version. Is it a known bug?
The easiest way to go about is to add:Task.add_requirements("torch", "==1.11.0") task = Task.init(...)
Then it will auto detect your custom package, and will always add the torch version. The main issue with relying on the package...
Sure set os environment 'CLEARML_NO_DEFAULT_SERVER=1`
UnsightlyShark53 Awesome, the RC is still not available on pip, but we should have it in a few days.
I'll keep you posted here :)
and do you have import tensorflow in your code?
We actually added a specific call to stop the local execution and continue remotely , see it here: https://github.com/allegroai/trains/blob/master/trains/task.py#L2409
I looked at your task log on the github issue. It seems the main issue is that your notebook is Not stored as python code. Are you running it on jupyter notebook or is it ipython that you are runnig it on? Is this reproducible? If so what's the jupyter version, python and OS versions?
Hi MelancholyChicken65
I'm not sure you an control it, the ui deduces the URL based on the address you are browsing to: so if you go yo http://app.clearml.example.com you will get the correct ones, but you have to put them on the right subdomains:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#subdomain-configuration
Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
If the problem consists (i.e. trains failing to detect packages, please open a GitHub Issue so the bug will not get lost 🙂
Do you have any experience and things to watch out for?
Yes, for testing start with cheap node instances 🙂
If I remember correctly everything is preconfigured to support GPU instances (aka nvidia runtime).
You can take one of the templates from here as a starting point:
https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
the storage configuration appears to have changed quite a bit.
Yes I think this is part of an the cloud ready effort.
I think you can find the definitions here:
https://artifacthub.io/packages/helm/allegroai/clearml
EnviousStarfish54 good news, this is fully reproducible
(BTW: for some reason this call will pop the logger handler clearml installs, hence the lost console output)
cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
Correct
how can I enforce a specific wheel to be installed?
You mean like specific CUDA wheel ?
you can simple put the http link to the wheel in the "installed packages", it should work
You will be able to set it.
You will just not see the output in the console log , but everything is running and being executed
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami
BTW: if you need to set env variables you can also add -e PYTHONPATH=/new/path
to the docker args
Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )
Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)
Interesting, if this is the issue, a simple sleep after reporting should prove it. Wdyt?
BTW are you using the latest package? What's your OS?