Reputation
Badges 1
979 × Eureka!Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
I will try to isolate the bug, if I can, I will open an issue in trains-agent π
I tested by installing flask in the default env -> which was installed in the ~/.local/lib/python3.6/site-packages
folder. Then I created a venv with flag --system-site-packages
. I activated the venv and flask was indeed available
I finally found a workaround using cache, will detail the solution in the issue π
My bad, alpine is so light it doesnt have bash
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
Is there any logic on the server side that could change the iteration number?
For new projects it works π
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
(Btw the instance listed in the console has no name, it it normal?)
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
Ok, deleting installed packages list worked for the first task
Why would it solve the issue? max_spin_up_time_min
should be the param defining how long to wait after starting an instance, not polling_interval_time_min
, right?
btw I monkey patched igniteβs function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(β¦, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonβt start because the userdata script fails
edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
the instances takes so much time to start, like 5 mins
the deep learning AMI from nvidia (Ubuntu 18.04)
The task with id a445e40b53c5417da1a6489aad616fee
is not aborted and is still running
thanks for your help!
Yea again I am trying to understand what I can do with what I have π I would like to be able to export as an environment variable the runtime where the agent is installing, so that one app I am using inside the Task can use the python packages installed by the agent and I can control the packages using clearml easily
Just found yea, very cool! Thanks!