Very nice thanks, I'm going to try the SA server + agents setup this week, let's see how it goes ✌
Makes sense
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
Moreover, since I'm going to use Task.execute_remotely
(and not through the UI) is there any code way to specify the docker image to be used?
` name: XXXXXXXXXX
on:
workflow_dispatch
jobs:
test-monthly-predictions:
runs-on: self-hosted
env:
DATA_DIR: ${{ secrets.RUNNER_DATA_DIR }}
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.RUNNER_CREDS }}
steps:
# Checkout
- name: Check out repository code
uses: actions/checkout@v2
# Setup python environment
- name: Setup up python environment using Poetry
run: |
/home/elior/.poetry/bin/poetry env use python3.9
...
AgitatedDove14
AgitatedDove14 all I did was to cerate this metric as "last" and then turned on the "max" and "min" and then turned them off
I can't reproduce it now but:
I restarted the services and it didn't help I deleted the columns, and created them again after a while and it helped
We try to break up every thing into independent tasks and group them using a pipeline. The dependency on an agnet caused an unnecessary overhead since we just want to execute locally. It became a burden once new data scientists join the project and instead of just telling them "yeah, just execute this script" you have to now teach them about clearml, the role of agents, how to launch them, how they behave, how to remove them and stuff like that... things you want to avoid with data scientists
how do I run this wizard? is this wizard train's or aws's?
I mean I don't get how all the pieces add up
So once I enqueue it is up? Docs says I can configure the queues that the auto scaler listens to in order to spin up instances, inside the auto scale task - I wanted to make sure that this config has nothing to do to where the auto scale task was enqueued to
and in the UI configuration I didn't understand where does permission management came into play
Okay, so let me get this straight
The autoscaling is basically an ever-running task (lets say on the services
queue). Now, the actual auto scaling and which queues exist have nothign to do with that, and are configured in the auto scale task?
So prior to doing any work on the trains autoscaler servcice, I should first create a auto scaling group in AWS?
Oh... from the docs I understood that I don't have to run the script, that I can either configure it in the UI, or with the sscript (wizard) so I ignored it up until now
Trains docs have at no point any mention on what should I do on the AWS interface... So I'm not sure at what point I should encounter this wizard
I'm going to play with it a bit and see if I can figure out how to make it work
What about permissions to the machines that are being spun up? For exampel if I want the instances to have specific permissions to read/write to S3 for example, how do I mange those?
You should try trains-agent daemon --gpus device=0,1 --queue dual_gpu --docker --foreground
and if it doesn't work try quoting trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
Another Q on that - does pyhocon
allows me to edit the file while keeping the comments in place?
I don't fully get it - it says it has to be enqueued
I was sure you are on Israel times as well, sorry for the night time thing 😄
Cool - what kind of objects are returned by .artifacts.
getitem
? I want to check their docs
AgitatedDove14 just a reminder if you missed this question 😄
And yes, it makes perfect sense, thanks for the answer
I only found Project ID, which I'm not sure what this refers to - I have the project name
and then how would I register the final artifact to the pipelien? AgitatedDove14 ⬆