Reputation
Badges 1
93 × Eureka!` Starting Task Execution:
usage: train_clearml_pytorch_ignite_caltech_birds.py [-h] [--config FILE]
[--opts ...]
PyTorch Image Classification Trainer - Ed Morris (c) 2021
optional arguments:
-h, --help show this help message and exit
--config FILE Path and name of configuration file for training. Should be a
.yaml file.
--opts ... Modify config options using the command-line 'KEY VALUE'
p...
This job did download pre-trained weights, so the only difference between them is the local dataset cache.
SuccessfulKoala55 A second queued job which executed on the same node, but didn't this time need to cache the dataset locally as it was done by the previous experiment, hasn't had this issue.
That all being said, apart from the console reporting looking messy, it doesn't appear to have impacted the training, or indeed the metric collection of the first experiment where it occurred.
Good question, SuccessfulKoala55
My thoughts are orbiting around environment orchestration and having a bit more control over how an environment is created. I understand that the easiest form of the configuration is to implement it on the clearml-agent side and run a daemon with the configuration as required, whether that be using venv's or docker containers. Of course this limits the deployment type to the queue that the daemon is listening to.
I was considering if that by exposing the...
Yes, there's an internal provisioning for it from the Azure VMSS.
However, that would mean passing back the hostname to the Autoscaler class.
Right now as it's written, the spin_up_worker
method doesn't update the class attributes. Following the AWS example that is also the case, where I can see it merely takes the arguments given, such as worker id, and constructs a node with those parameters e.g. hostname etc.
Looking at the supervisor
method of the base AutoScaler
clas...
SuccessfulKoala55 However, this was the first time an experiment with this dataset was executed on this compute node. I have been doing a lot of trial and error with this setup to get the models training, and so on my first compute node, I had the data downloading locally quite early on, so I haven't seen the script have to download a local dataset cache as it was already done.
AgitatedDove14
So can you verify it can download the model ?
Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.
I note that the StorageHelper.get()
method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do a...
I love the new design of the site.
When is clearml-deploy coming to the open source release?
Or is this a commercial only part?
This appears to confirm it as well.
https://github.com/pytorch/pytorch/issues/1158
Thanks AgitatedDove14 , you're very helpful.
In my case it's a Tesla P40, which has 24 GB VRAM.
Oh, so this applies to VRAM, not RAM?
Does "--ipc=host" make it a dynamic allocation then?
I believe the standard shared allocation for a docker container is 64 MB, which is obviously not enough for training deep learning image classification networks, but I am unsure of the best solution to fix the problem.
If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣
Pffff security.
Data scientist be like....... 😀
Network infrastructure person be like ...... 😱