Reputation
Badges 1
979 × Eureka!Setting it after the training correctly updated the task and I was able to store artifacts remotely
I have two controller tasks running in parallel in the trains-agent services queue
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample π€©
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
I found, the filter actually has to be an iterable:Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(type=["training"])))
in the UI the value is correct one (not empty, a string)
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
Yes π Thanks!
AgitatedDove14 How can I filter out tasks archived? I don't see this option
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
And since I ran the task locally with python3.9, it used that version in the docker container
What I put in the clearml.conf is the following:
agent.package_manager.pip_version = "==20.2.3" agent.package_manager.extra_index_url: ["
"] agent.python_binary = python3.8
So either I specify in the clearml-agent agent.python_binary: python3.8 as you suggested, or I enforce the task locally to run with python3.8 using task.data.script.binary
yes but they are in plain text and I would like to avoid that
how would it interact with the clearml-server api service? would it be completely transparent?
Alright I have a followup question then: I used the param --user-folder β~/projects/my-projectβ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
So I installed docker, added user to group allowed to run docker (not to have to run with sudo, otherwise it fails), then ran these two commands and it worked
Now it starts, Iβll see if this solves the issue
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a