Hi @<1551376687504035840:profile|StraightSealion9>
AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.
Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
@<1551376687504035840:profile|StraightSealion9> it seems to me the AWS instance simply fails to start (and thus never pulls from the queue). Can you share the instance log?
Hi @<1523701087100473344:profile|SuccessfulKoala55> and @<1523701205467926528:profile|AgitatedDove14> sorry for the delay, I was travelling. Indeed I see that the instance is not booting, it's sort of stuck there so I'm going to first try and find an AMI / instance type that boots via the AWS console and try again via the AutoScaler. Thanks so much for the tips!
So I've gotten a little further by tweaking my VPC setup. I see that the autoscaler spins up new instances w/o a public ip so that means specifying a private subnet and having a NAT gateway on the subnet right? I tried running the scikit-learn joblib sample on a CPU instance, and the experiment definitely ran but I see another error. Could there be some permissions missing somewhere?
2023-04-03 15:01:00
2023-04-03 13:00:51,087 - clearml.storage - ERROR - Exception encountered while uploading Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429):
2023-04-03 13:00:51,087 - clearml.metrics - WARNING - Failed uploading to
(Failed uploading object /_ApplicationInstances/aws-autoscaler/ClearML Autoscaler.f302dff7cab7446f9bfeaf923622a2d0/artifacts/i-017eb35a20bd622b4/i-017eb35a20bd622b4.txt (429): )
2023-04-03 13:00:51,087 - clearml.metrics - ERROR - Not uploading 1/1 events because the data upload failed
Hi @<1551376687504035840:profile|StraightSealion9> , this is really strange - I don't think we ever return 429 from the .clear.ml servers...
@<1523701827080556544:profile|JuicyFox94> do you have any idea how this could happen?
@<1551376687504035840:profile|StraightSealion9> where do you see this error? Something if off here - is this a task running on an instance spun up by an AutoScaler application?
Sorry I've been away and just able to come back to this now. I'll run the scikit-learn job sample on my current setup and let you know if it comes back.
BTW my first test worked great on a CPU instance. Might have been a temporary issue.