Reputation
Badges 1
979 × Eureka!and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working
no it doesn't! 3. They select any point that is an improvement over time
Thanks!3. I don't know, I never used Highcharts ๐
I am not using hydra, I am reading the conf with:config_dict = read_yaml(conf_yaml_path) config = OmegaConf.create(task.connect_configuration(config_dict))
But I am not sure it will connect the parameters properly, I will check now
Doing it the other way around works:
` cfg = OmegaConf.create(read_yaml(conf_yaml_path))
config = task.connect(cfg)
type(config)
<class 'omegaconf.dictconfig.DictConfig'> `
but then why do I have to do task.connect_configuration(read_yaml(conf_path))._to_dict()
?
Why not task.connect_configuration(read_yaml(conf_path))
simply?
I mean what is the benefit of returning ProxyDictPostWrite
instead of a dict?
Same, it also returns a ProxyDictPostWrite
, which is not supported by OmegaConf.create
I mean, inside a parent, do not show the project [parent] if there is nothing inside
correct, you could also use
Task.create
that creates a Task but does not do any automagic.
Yes, I didn't use it so far because I didn't know what to expect since the doc states:
"Create a new, non-reproducible Task (experiment). This is called a sub-task."
Because it lives behind a VPN and github workers donโt have access to it
No worries! I asked more to be informed, I don't have a real use-case behind. This means that you guys internally catch the argparser object somehow right? Because you could also simply use sys argv to find the parameters, right?
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
it actually looks like I donโt need such a high number of files opened at the same time
because at some point it introduces too much overhead I guess
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
I will try addingsudo sh -c "echo '\n* soft nofile 65535\n* hard nofile 65535' >> /etc/security/limits.conf"
to the extra_vm_bash_script
, maybe thatโs enough actually
So actually I donโt need to play with this limit, I am OK with the default for now
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES
if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
Nevermind, i was able to make it work, but no idea how
with 1.1.1 I getUser aborted: stopping task (3)