Reputation
Badges 1
121 × Eureka!Hi AgitatedDove14 , Just your reply on https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045Basically as jobs are pulled by order, they are pushed into the k8s, then if we hit the max k8s instance limit, we stop pulling jobs until a k8s job is completed, then we continue.
For this scenario,
k8s has an instance limit of 10 (let's say)
I run Optimization (it has about 100 jobs) but only the first 10 will be pulled in k8. After this, I run a single Deep Learning (DL)...
` python3 k8s_glue_example.py --queue glue_high_q glue_low_q
usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode] [--num-of-services NUM_OF_SERVICES] [--base-port BASE_PORT] [--base-pod-num BASE_POD_NUM] [--gateway-address GATEWAY_ADDRESS]
[--pod-clearml-conf POD_CLEARML_CONF] [--overrides-yaml OVERRIDES_YAML] [--template-yaml TEMPLATE_YAML] [--ssh-server-port SSH_SERVER_PORT] [--namespace NAMESPACE]
k8s_glue_example.py: error: unrecognized arguments: glue...
Hi AgitatedDove14
I am still not very clear on using this, even after looking at k8s_glue_example.py 's code
Is it possible to give a sample usage of how this works ?python k8s_glue_example.py --ports-mode --num-of-services
Another question, I am still not sure , how this resolves my original question.
https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which ex...
Hi AgitatedDove14 , Thanks for the explanation .python k8s_glue_example.py --queue high_priority_q --ports-mode --num-of-services 10 python k8s_glue_example.py --queue low_priority_q --ports-mode --num-of-services 2
Would the above be a good way to simulate the below ?clearml-agent daemon --queue high_priority_q low_priority_q
Hi AgitatedDove14 , Now we prefer to run dynamic agents instead usingpython3 k8s_glue_example.py
In this case, is it still possible to pass --order-fairness at the queue level or this is more of a Enterprise edition feature.
Is there any documentation on how, we can use this ports mode ? I didnt seem to find any.. Tks
The use case, is lets say i runpython k8s_glue_example.py --queue glue_q
And some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.
AgitatedDove14 I am confused now.. Isnt this feature not available in the k8 glue ? Or is it going to be implemented ?
RoughTiger69
So prefect tasks :
Loads Data into clearml-data Runs trainining in clearml Publish model (manual trigger required, so user publishes model) and return model url Seldon deploys the model ( model url passed in)
we also might have some other steps incorporated for other tools. We intend to have Label-Studio upstream.. So defintely needed some orchestrator tool
Ah, so in the future, we can add non-clearml code as a step in the pipeline controller.
One use case now :
Load Data from Label Studio (Manager to manually approve) Push data to Clearml-data Run Training (Manager to manually Publish) Pushes model uri to next step Seldon deploy itLater, if seldon detects a data drift, it will automatically run (steps 2-5)..
At this point, we havent drilled all of it down yet
Hi AgitatedDove14 ,
At this point, Showing the url of the cleamltask might be sufficient. Unless in the future, someone wants it to be customised.
But the bigger question is if there is tool to aid with this workflow building ? We are currently experimenting with airflow/prefect.
sure, I'll post some questions once I wrap my mind around it..
Hi SuccessfulKoala55 , kkie..
1)Actually, now i am using AWS. I am trying to set up Clearml server in K8. However, clearml-agents will be just another ec2-instance/docker image.
2) For phase 2, I will try Clearml AWS AutoScaler Service.
3) At this point, I think I will have a crack at JuicyFox94 's solution as well.
Thanks JuicyFox94 .
Not really from devops background, Let me try to digest this.. 🙏
nice.. this looks a bit friendly.. 🙂 .. Let me try it.. Thanks
Hi AgitatedDove14 , This isnt the issue. With or without specifying the queue, I have this error when I do the "Create version" as compared to the "Init version".
I wonder whether this is some issue with using the Create version together with execute_remotely() ..
We have to do it in-premise.. Cloud providers are not allowed for the final implementation. Of course, now we use Cloud to test out our ideas.