The rest of the configuration is set with env variables
in my clearml.conf, I only have:sdk.aws.s3.region = eu-central-1 sdk.aws.s3.use_credentials_chain = true agent.package_manager.pip_version = "==20.2.3"
I see 3 agents in the "Workers" tab
Some context: I am trying to log an HTML file and I would like it to be easily accessible for preview
continue_last_task
is almost what I want, the only problem with it is that it will start the task even if the task is completed
As to why: This is part of the piping that I described in a previous message: Task B requires an artifact from task A, so I pass the name of the artifact as a parameter of task B, so that B knows what artifact from A it should retrieve
Thanks! Unfortunately still not working, here is the log file:
I would like to try it to see if it solves some dependencies not found eventhough they are installed when using --system-site-packages
sure, will be happy to debug that 🙂
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂
I tested by installing flask in the default env -> which was installed in the ~/.local/lib/python3.6/site-packages
folder. Then I created a venv with flag --system-site-packages
. I activated the venv and flask was indeed available
My bad, alpine is so light it doesnt have bash
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
Is there any logic on the server side that could change the iteration number?
For new projects it works 🙂
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
(Btw the instance listed in the console has no name, it it normal?)
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
Ok, deleting installed packages list worked for the first task
Why would it solve the issue? max_spin_up_time_min
should be the param defining how long to wait after starting an instance, not polling_interval_time_min
, right?
btw I monkey patched ignite’s function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
edited the aws_auto_scaler.py, actually I think it’s just a typo, I just need to double the brackets