Reputation
Badges 1
25 × Eureka!Task.init should be called before pytorch distribution is called, then on each instance you need to call Task.current_task() to get the instance (and make sure the logs are tracked).
Thanks JitteryCoyote63 let me double check if there is a reason for that (there might be one, not sure)
Create one experiment (I guess in the scheduler)
task = Task.init('test', 'one big experiment')
Then make sure the the scheduler creates the "main" process as subprocess, basically the default behavior)
Then the sub process can call Task.init and it will get the scheduler Task (i.e. it will not create a new task). Just make sure they all call Task init with the same task name and the same project name.
for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.
This actually implies each notebook is a standalone "process", which makes a ton of sense. But this is where notebooks and proper SW design break, in traditional SW, the notebooks are actually python files, and then of course you can import one from another, unfortunately this does not work in notebooks...
If you are really keen on using notebooks I wou...
Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug
LOL totally π
Hmm interesting, I guess once you are able to connect it with ClearML you can just clone / modify / enqueue and let users train models directly from the UI on any hardware, is that the plan ?
Is there a solution for that?
Hi DisturbedElk70
Well assuming you mount/sync the "temp" folder of the offline experiment to a storage solution, then have another process (on the other side), syncing these folders, it will work and you will get "real-time" updates π
Offline Folder:get_cache_dir() / 'offline' / task_id
We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)
Hi TrickyRaccoon92
Yes please update me once you can, I would love to be able to reproduce the issue so we could fix for the next RC π
Do you think such a feature exists in ClearML?
Currently this is "fixed" for iterations (which is actually just a integer monotonic value) or the time stamp.
But I cannot see any reason why we could not allow users to control the x-axis title, and to be able to set it in code, I'm assuming this is what you have in mind?
but cant catch that only one way for service queue or I can experiments with that?
UnevenOstrich23 I'm not sure what exactly is the question, but if you are asking weather this is limited, the answer is no it is not limited to that use case.
Specifically you can run as many agents in "services-mode" pulling from any queue/s that you need, and they can run any Task that is enqueued on those queues. There is no enforced limitation. Did that answer the question ?
does clearml expect them to be actuall installed to add them as installed packages for a task?
It should add itself to the list (assuming you will end up calling Task.init in your code)
And can I store models with no attachment to tasks?
Assuming you have the Model ID :model = InputModel(model_id='aabbcc') local_file_or_folder = model.get_weights()Is this what you are looking for?
Hi @<1581454875005292544:profile|SuccessfulOtter28>
Why would you archive an experiment?
Because you do not want to see it any longer (i.e. not very important) but you do not want to loose the ability to later do some forensics and look into it (meaning you do not want to completely delete it)
does that make sense ?
Seems the apiserver is out of connections, this is odd...
SuccessfulKoala55 do you have an idea ?
because fastaiβs tensorboard doesnβt work in multi gpu
keep me posted when this is solved, so we can also update the fastai2 interface,
You mean like a name of the artifact ?
PricklyRaven28 basically this is the issue:
python -m fastai.launch <script>
There are multiple copies of the script running, but they are Not aware of one another.
are you getting any reporting from the diff GPUs? I'm assuming there is a hidden OS environment that signals the "master" node, so all processes can communicate with it. This is what we should automatically capture. There is a workaround the fastai.launch, that is probably similar to this one:
Hi @<1559711593736966144:profile|SoggyCow20>
I would first like to say how amazing clearml is!
Thank you! π
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
yes sdk.agent.default_docker.image = python:3.10.0-alpine should beagent.default_docker.image = python:3.10.0-alpine
Notice the scope is agent, not sdk
No I was was pointing out the lack of one
Sounds like a great idea, could you open a github issue (if not already opened) ? just so we do not forget
set the pytorch lightning trainer argument
log_every_n_steps
to
1
(default
50
) to prevent the ClearML iteration logger from timing-out
Hmm that should not have an effect on the training time, all logs are send in the background, that said checkpoints might slow it a bit (i.e.; i...
@<1523706266315132928:profile|DefiantHippopotamus88> seems like you are missing the ports π
CLEARML_WEB_HOST="
"
CLEARML_API_HOST="
"
CLEARML_FILES_HOST="
"
The confusion matrix shows under debug sample, but the image is empty, is that correct?
I suspect it failed to create one on the host and then mount into the docker
Hi @<1523701083040387072:profile|UnevenDolphin73>
How can I ensure tasks in a pipeline have the same environment as the pipeline itself?
...
but the tasks (executed remotely) do not use that same environment?
Just verifying, we are talking about pipeline decorators?
We also wanted this, we preferred to create a docker image with all we need, and let the pipeline steps use that docker image
You can specify the docker on the decorator itself:
[None](https://github.com/allegroai...
WickedGoat98 this is awesome! Let me know how I could help π
BTW: I checked regrading the plot comparison, this is a BE issue due to the size of the plot, I was told a fix will be deployed in a day or two.
so for example if there was an idle GPU and Q3 take it and then there is a task comes to Q2 which we specified 3GPU but now the Q3 is taken some of these GPU what will happen
This is a standard "race" the first one to come will "grab" the GPU and the other will wait for it.
I'm pretty sure enterprise edition has preemption support, but this is not currently part of the open source version (btw: also the dynamic GPU allocation, I think, is part of the enterprise tier, in the opensource ...