Reputation
Badges 1
981 × Eureka!ok, what is the 3.8 release? a server release? how does this number relates to the numbers above?
ok, but will it install as expected the engine and its dependencies?
Why would it solve the issue? max_spin_up_time_min should be the param defining how long to wait after starting an instance, not polling_interval_time_min , right?
Alright I have a followup question then: I used the param --user-folder โ~/projects/my-projectโ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
self.clearml_task.get_initial_iteration() also gives me the correct number
Hi SuccessfulKoala55 , thanks for the idea! the function isnโt called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
amazon linux
so that any error that could arise from communication with the server could be tested
trains==0.16.4
So either I specify in the clearml-agent agent.python_binary: python3.8 as you suggested, or I enforce the task locally to run with python3.8 using task.data.script.binary
Hi AgitatedDove14 , I donโt see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
Is there one?
No, I rather wanted to understand how it worked behind the scene ๐
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
Thatโs awesome!
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
(I use trains-agent 0.16.1 and trains 0.16.2)
I specified a torch @ https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
Also, from https://lambdalabs.com/blog/install-tensorflow-and-pytorch-on-rtx-30-series/ :
As of 11/6/2020, you can't pip/conda install a TensorFlow or PyTorch version that runs on NVIDIA's RTX 30 series GPUs (Ampere). These GPUs require CUDA 11.1, and the current TensorFlow/PyTorch releases aren't built against CUDA 11.1. Right now, getting these libraries to work with 30XX GPUs requires manual compilation or NVIDIA docker containers.
But what wheel is downloading trains in that case?