Reputation
Badges 1
981 × Eureka!So the migration from one server to another + adding new accounts with password worked, thanks for your help!
Thanks! Unfortunately still not working, here is the log file:
Make sure the cloned task is in Draft mode, if not, reset it
Then in the Execution tab of th task, in the Source Code section (first one), you can edit the values
AgitatedDove14 Yes with the command you shared I can now ssh again manually to the agent, but I still clearml-agent will raise the same error
ok, thanks SuccessfulKoala55 !
So it looks like the agent, from time to time thinks it is not running an experiment
hooo now I understand, thanks for clarifying AgitatedDove14 !
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task to run it on one of my agents. I adapted a bit the script (removed python-fire since itās not yet supported by clearml).
and this works. However, without the trick from UnevenDolphin73 , the following wonāt work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
So actually I donāt need to play with this limit, I am OK with the default for now
it actually looks like I donāt need such a high number of files opened at the same time
I will try addingsudo sh -c "echo '\n* soft nofile 65535\n* hard nofile 65535' >> /etc/security/limits.conf"to the extra_vm_bash_script , maybe thatās enough actually
AgitatedDove14 , my āuncommitted changesā ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
mmmh it fails, but if I connect to the instance and execute ulimit -n , I do see65535while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))Which gives me in the logs of the task:b'1024'So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300
AgitatedDove14 I cannot confirm at 100%, the context is different (see previous messages) but it could be the same bug behind the scene...
What is weird is:
Executing the task from an agent: task.get_parameters() returns an empty dict Calling task.get_parameters() from a local standalone script returns the correct properties, as shown in web UI, even if I updated them in UI.So I guess the problem comes from trains-agent?
More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9 (both when executed in trains-agent and in local script)
So in my minimal reproducable example, it does work 𤣠very frustrating, I will continue searching for that nasty bug
I just read, I do have the trains version 0.16 and the experiment is created with that version
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another š
As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve
(btw, yes I adapted to use Task.init(...output_uri=)