Still failing with the same error 😞
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
but if the task is now running on an agent, isn’t is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running
Thanks a lot AgitatedDove14 !
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1 in the agents?
Or more generally, try to catch this error and retry a few times?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
AgitatedDove14 Up 🙂 I would like to know if I should wait for next release of trains or if I can already start implementing azure support
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
ok, what is the 3.8 release? a server release? how does this number relates to the numbers above?
ok, but will it install as expected the engine and its dependencies?
Why would it solve the issue? max_spin_up_time_min should be the param defining how long to wait after starting an instance, not polling_interval_time_min , right?
Alright I have a followup question then: I used the param --user-folder “~/projects/my-project”, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
self.clearml_task.get_initial_iteration() also gives me the correct number
Hi SuccessfulKoala55 , thanks for the idea! the function isn’t called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
so that any error that could arise from communication with the server could be tested
trains==0.16.4
So either I specify in the clearml-agent agent.python_binary: python3.8 as you suggested, or I enforce the task locally to run with python3.8 using task.data.script.binary
Hi AgitatedDove14 , I don’t see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
Is there one?
No, I rather wanted to understand how it worked behind the scene 🙂
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
That’s awesome!
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged