Reputation
Badges 1
979 × Eureka!edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
the instances takes so much time to start, like 5 mins
the deep learning AMI from nvidia (Ubuntu 18.04)
The task with id a445e40b53c5417da1a6489aad616fee
is not aborted and is still running
thanks for your help!
Yea again I am trying to understand what I can do with what I have π I would like to be able to export as an environment variable the runtime where the agent is installing, so that one app I am using inside the Task can use the python packages installed by the agent and I can control the packages using clearml easily
Just found yea, very cool! Thanks!
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
It indeed has the old commit, so they match, no problem actually π
Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = "
s3://my_bucket
" had no effect (it was placed BEFORE the training)
basically:
` from trains import Task
task = Task.init("test", "test", "controller")
task.upload_artifact("test-artifact", dict(foo="bar"))
cloned_task = Task.clone(task, name="test", parent=task.task_id)
cloned_task.data.script.entry_point = "test_task_b.py"
cloned_task._update_script(cloned_task.data.script)
cloned_task.set_parameters(**{"artifact_name": "test-artifact"})
Task.enqueue(cloned_task, queue_name="default") `
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I have two controller tasks running in parallel in the trains-agent services queue
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample π€©
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
I found, the filter actually has to be an iterable:Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(type=["training"])))
in the UI the value is correct one (not empty, a string)
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)