BitingKangaroo95 nice work π
I think that what did it was:
change the sshd_config so that it allows port forwarding , agent forwarding and x11 forwardingBut just in case, it might be there was a pre existing SSH identifier on your machine, and hence the error.
clear known_hosts under ~/.ssh was also something I would try π
Hi @<1573119955400921088:profile|CloudyPelican46>
On what machine is it best practice to run the clean up service, local machine or should it be on the clearml server ?
The easiest is to run it on the server machine itself, even though in practice you can put it anywhere, but most of the time this service is sleeping and not using so much RAM so it kind of makes sense
It should be under script.diff:'script': {'binary': '', 'repository': '', 'tag': '', 'branch': '', 'version_num': '', 'entry_point': '', 'working_dir': '', 'requirements': {'pip': ''}, 'diff': ''}For some reason this is empty in your case, are you seeing it in the UI?
If you are querying the current task (i.e. running) it might not be there yet.
You can call this internal function that returns only after the repo detection is done.task._wait_for_repo_detection()
but here I can tell them: return a dictionary of what you want to save
If this is the case you have two options, either store the dict as an artifact (this makes sense if this is not standalone model you would like to later use), or store as an artifact.
Artifact example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
getting them back
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.py
Model example:
https:/...
Hi DullCamel78
Hi everyone! Has anyone tried running
aws_autoscaler.py without docker?
Well generally since this is a remote machine the easiest way to control environment is with containers, hence the default use case. In theory you can change it to use venv, but then of course your a somewhat limited with the diff drivers/cuda/python environement.
performance under docker is 10% lower than on bare metal
add to your extra docker args
` extra_docker_arguments: ["...
hmm... try to run the trains-agent from the ml environment with "system_site_packages: true", it might do the trick. Anyhow please let me know if it worked π
Hi @<1523715429694967808:profile|ThickCrow29>
clearml.automation.auto_scaler.AutoScaler which runs smoothly (kudos!!).
NICE!
The only thing I am missing is the in the clearml dashboard/orchestration --> Is there a way to make it
hmm kind of needs backend support for that π
For now, I can just see the log of the clearML task to monitor whatβs happening
Or is this retricted to pro user ?
Yeah the GCP and AWS autoscalers dashboards are paid tier feature. But...
And If I create myself a Pro account
Then you have the UI and implementation of both AWS & GCP autoscalers, am I missing something?
Hi TenseOstrich47
You can check the new clearml-serving , and the new python interfaces added to the "Model" class.
https://github.com/allegroai/clearml/blob/22d795f68f0175ba9511cabd444ea4dba464f3cd/clearml/model.py#L444
Hi CluelessElephant89
When you edit the args (General section) in the UI, you are editing the args for "remote execution"
(i.e. when executed by the agent, the args dict will get the values from the UI , as oppsed to "manual execution" where there UI gets the values from code)
In order to simulate the "remote execution" inside your development environment
Try:
` from clearml import Task
simulate remote execution of a specific Task instance
Task.debug_simulate_remote_task(task_id='R...
so I wanted to keep our βforkβ of the autoscaler but I guess this is not supported.
you are correct π
I wonder, " I customized it a bit to our workflow " what did you add?
Hi FiercePenguin76
So currently the idea is you have full control over per user credentials (i.e. stored locally). Agents (depending on how deployed) can have shared credentials (with AWS the easiest is to push to the OS env)
Hi GrotesqueMonkey62 any chance you can be a bit more specific? Maybe a screen grab?
Here is how it works, if you look at an individual experiment scalars are grouped by title (i.e. multiple series on the same graph if they have the same title)
When comparing experiments, any unique combination of title/series will get its own graph, then the different series on the graph are the experiments themselves.
Where do you think the problem lays ?
I can't think of any actual difference in flow ...
Can you try the following?task._setup_reporter() task.set_initial_iteration(0)
Hi SarcasticSparrow10
Is it better to post such questions on Stackoverflow so they benefit everybody?
Yes, I think you are correct it would please do π
Try to do " reuse_last_task_id='task_id_here'" ,t o specify the exact Task to continue )click on the ID button next to the task name in the UI)
If this value is true it will try to continue the last task on the current machine (based on project/name, combination) if the task was executed on another machine, it will just start a ...
The latest TAO doesn't use python for fine tuning, rather it uses the CLI entirely
It's a good question, but I think the CLI actually just runs a python code (the CLI is their interface). Generally speaking I'm pretty sure it will not be complicated to convert the TLT integration to support TAO (Nvidia helps with that, and I think we had a similar proces with Nvidia Clara/MONAI)
BTW: how are you using Nvidia TAO ?
orchestration module
When you previously mention clone the Task I the UI and then run it, how do you actually run it?
regarding the exception stack
It's pointing to a stdout that was closed?! How could that be? Any chance you can provide a toy example for us to debug?
Hi CostlyElephant1
What do you mean by "delete raw data"? Data is always fetched to cached folders and clearml takes care of cache cleanup
That said notice that get mutable copy is a target you specify, in this case you should definetly delete after usage. Wdyt ?
Is there an easy way to add a docker argument in the python script?
On the task it self in the UI you can edit the docker arguments and add any missing flags
(task.set_base_docker will do the same from code)
You can also edit the configuration and always add this flag:
None
I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
LOL yeah, btw: this is exactly the reason the enterprise version has a vault feature, so one could edit the base configuration in the UI and it automatically propagates everywhere
but docker_arguments doesn't propagate if I leave docker_image as None
yeah, that's correct, you have to select a container to be used
Failed to initialize NVML: Unknown Error
yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?
Hi @<1631102016807768064:profile|ZanySealion18>
ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.
what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?
Hi RobustRat47
My guess is it's something from the converting PyTorch code to TorchScript. I'm getting this error when trying the
I think you are correct see here:
https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/examples/pytorch/train_pytorch_mnist.py#L136
you have to convert the model to TorchScript for Triton to serve it
VexedCat68
a Dataset is published, that activates a Dataset trigger. So if every day I publish one dataset, I activate a Dataset Trigger that day once it's published.
From this description it sounds like you created a trigger cycle, am I missing something ?
Basically you can break the cycle by saying, trigger only on New Dataset with a specific Tag (or create the auto dataset in a different project/sub-project).
This will stop your automatic dataset creation from triggering the "orig...
No I was was pointing out the lack of one
Sounds like a great idea, could you open a github issue (if not already opened) ? just so we do not forget
set the pytorch lightning trainer argument
log_every_n_steps
to
1
(default
50
) to prevent the ClearML iteration logger from timing-out
Hmm that should not have an effect on the training time, all logs are send in the background, that said checkpoints might slow it a bit (i.e.; i...
why are there indefinitely growing anonymous tasks, even after i've closed the main schedulers.
The anonymous Tasks are The Dataset you are creating (a Dataset version is also a Task of a certain type with artifacts, the idea is usually Datasets are created from code, hence the need to combine the two).
Make sense ?