Reputation
Badges 1
979 × Eureka!No worries! I asked more to be informed, I don't have a real use-case behind. This means that you guys internally catch the argparser object somehow right? Because you could also simply use sys argv to find the parameters, right?
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
it actually looks like I don’t need such a high number of files opened at the same time
because at some point it introduces too much overhead I guess
mmmh it fails, but if I connect to the instance and execute ulimit -n
, I do see65535
while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'
and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))
Which gives me in the logs of the task:b'1024'
So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
I will try addingsudo sh -c "echo '\n* soft nofile 65535\n* hard nofile 65535' >> /etc/security/limits.conf"
to the extra_vm_bash_script
, maybe that’s enough actually
So actually I don’t need to play with this limit, I am OK with the default for now
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES
if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
Nevermind, i was able to make it work, but no idea how
with 1.1.1 I getUser aborted: stopping task (3)
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running
I finally found a workaround using cache, will detail the solution in the issue 👍
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
That’s why I said “not really” 😄
Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
So this message appears when I try to ssh directly into the instance
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
Yes, but I am not certain how: I just deleted the /data folder and restarted the server