Reputation
Badges 1
533 × Eureka!The only way to change it is to convert apiserver_conf
to a dictionary object ( as_plain_ordered_dict()
) and edit it
Config == conf_obj
no?
This is a part of a bigger process which times quite some time and resources, I hope I can try this soon if this will help get to the bottom of this
` name: XXXXXXXXXX
on:
workflow_dispatch
jobs:
test-monthly-predictions:
runs-on: self-hosted
env:
DATA_DIR: ${{ secrets.RUNNER_DATA_DIR }}
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.RUNNER_CREDS }}
steps:
# Checkout
- name: Check out repository code
uses: actions/checkout@v2
# Setup python environment
- name: Setup up python environment using Poetry
run: |
/home/elior/.poetry/bin/poetry env use python3.9
...
moreover I think I found a bug
btw my site packages is false - should it be true? You pasted that but I'm not sure what it should be, in the paste is false but you are asking about true
and the machine I have is 10.2.
I also tried nvidia/cuda:10.2-base-ubuntu18.04 which is the latest
When I said not the expected behavior, I meant that following the instructions on the docs, should lead to downloading the latest version
Even assuming it suspects me, why doesn't the captcha prove my innocence? Isn't it what it is for O_O
Makes sense
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
Moreover, since I'm going to use Task.execute_remotely
(and not through the UI) is there any code way to specify the docker image to be used?
I was sure you are on Israel times as well, sorry for the night time thing 😄
okay lets go
One sec I'll paste the relevant pieces of code
when spinning up the ami i just went for trains recommended settings
Increased to 20, lets see how long will it last 🙂
FriendlySquid61
Just updating, I still haven't touched this.... I did not consider the time it would take me to set up the auto scaling, so I must attend other issues now, I hope to get back to this soon and make it work
If you want we can do live zoom or something so you can see what happens
AgitatedDove14 I really don't know how is this possible... I tried upgrading the server, tried whatever I could
About small toy code to reproduce I just don't have the time for that, but I will paste the callback I am using to this explanation. This is the overall logic so you can replicate and use my callback
From the pipeline task, launch some sub tasks, and put in their post_execute_callback
the .collect_description_tables
method from my callback class (attached below) Run t...
AgitatedDove14 just so you'd know this is a severe problem that occurs from time to time and we can't explain why it happens... Just to remind, we are using a pipeline controller task, which at the end of the last execution gathers artifacts from all the children tasks and uploads a new artifact to the pipeline's task object. Then what happens is that Task.current_task()
returns None
for the pipeline's task...
AgitatedDove14 sorry for the late reply,
It's right after executing all the steps. So we have the following block which determines whether we run locally or remotely
if not arguments.enqueue: pipe.start_locally(run_pipeline_steps_locally=True) else: pipe.start(queue=arguments.enqueue)
And right after we have a method that calls Task.current_task()
which returns None
Is there a more elegant way to find the process to kill? Right now I'm doing pgrep -af trains
but if I'll have multiples agents, I will never be able to tell them apart
it will return a Config
object right?
why not use my user and group?