single task in the DAG is an entire ClearML
pipeline
.
just making sure detials are not lost, "entire ClearML pipeline ." : the pipeline logic is process A running on machine AA.
Every step of that pipeline can be (1) subprocess, but that means the exact same environement is used for everything, (2) The DEFAULT behavior, each step B is running on a different machine BB.
The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and tr...
Hi @<1524560082761682944:profile|MammothParrot39>
By default you have the last 100 iterations there (not sure why you are only seeing the last 3), but this is configurable:
None
BattyLion34 Okay, I'll try to see if we can solve the multi-instance issue on Windows (because obviously it should be automatic)
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
Could you send me the cosnole log of both tasks, failing and passing one?
BattyLion34 is this consistent?
(Really I can't see eny difference, one time it is able to create the venv and another it is failing with permission error)
Thanks BattyLion34 I fixed the code snippet :)
UnevenDolphin73 something like this one?
https://github.com/allegroai/clearml/pull/225
Hi NastyOtter17
"Project" is so ambiguous
LOL yes, this is something GCP/GS is using:
https://googleapis.dev/python/storage/latest/client.html#module-google.cloud.storage.client
How can the first process corrupt the second
I think that something went wrong and both Agents are using the same "temp" folder to setup the experiment.
why doesn't this occur if I run pipeline from command line?
The services queue is creating new dockers with everything in them so they cannot step on each others toes (so to speak)
I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine....
Hi RobustRat47
My guess is it's something from the converting PyTorch code to TorchScript. I'm getting this error when trying the
I think you are correct see here:
https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/examples/pytorch/train_pytorch_mnist.py#L136
you have to convert the model to TorchScript for Triton to serve it
Hi BattyLion34
I might have a solution, in order to make sure the two agents are not sharing the "temp" folder:
create two copies of ~/clearml.conf , let's call them :
~/clearml_service.conf ~/clearml_agent.confThen in each one select a different venvs_dir see here:
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L90
for example:
~/.clearml/venvs-builds1 ~/.clearml/venvs-builds2Now start the two agents with:
The service age...
BattyLion34 let me see if I understand.
The same base_task_id when cloned by the UI and enqueues on the same queue as the pipeline, will work but when the pipeline runs the same Task it fails?!
Could it be that you enqueue them on different queues ?
BattyLion34
Maybe something inside the task is different?!
Could you run these lines and send me the result:from clearml import Task print(Task.get_task(task_id='failing task id').export_task()) print(Task.get_task(task_id='working task id').export_task())
Hi @<1533257411639382016:profile|RobustRat47>
sorry for the delay,
Hi when we try and sign up a user with github.
wait, where are you getting this link?
That should spin up an instance, right? (it currently doesn't, and I'm not sure where to debug)
Do you see the AWS scaler Task running ?
(This is the code/process that actually spins a new EC2 instance)
Hi JuicyDog96
The easiest way is:from trains.backend_api.session.client import APIClient client = APIClient() client.projects.get_all()You can just run it from a python console and check what you are getting.
Full API is https://github.com/allegroai/trains/tree/master/trains/backend_api/services/v2_8
BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere 🙂
sorry the point where you select the interpreter for pycharm
Oh I see...
ResponsiveCamel97
could you attach the full log?
ElegantCoyote26 could you upgrade the docker-compose ?
First that is awesome to hear PanickyFish98 !
Can you send the full exception? You might be on to something...
2. Actually we thought of it, but could not find a use case, can you expand?
3. I'm not sure I follow, do you mean you expect the first execution to happen immediately?
That makes no sense to me?!
Are you absolutely sure the nntrain is executed on the same queue? (basically could it be that the nntraining is executed on a different queue in these two cases ?)
This will fix it, the issue is the "no default value" that breaks the casting@PipelineDecorator.component(cache=False) def step_one(my_arg=""):
Hmm that is odd, let me see if I can reproduce it.
What's the clearml version you are using ?
BattyLion34
if I simply clone nntraining stage and run it in default queue - everything goes fine.
When you compare the Task you clone manually and the Task created by the pipeline , what's the difference ?
No, I mean actually compare using the UI, maybe the arguments are different or the "installed packages"
Any updates on trigger and schedule docsÂ
I think examples are already pushed, docs still in progress.
BTW: pipeline v2 examples are also out:
https://github.com/allegroai/clearml/blob/master/examples/scheduler/trigger_example.py
https://github.com/allegroai/clearml/blob/master/examples/pipeline/full_custom_pipeline.py