Reputation
Badges 1
25 × Eureka!Hi @<1569496075083976704:profile|SweetShells3>
Are you using the standard docker-compose ? are using the default elastic container ?
What exactly changed ?
Hi WickedElephant66
Setting the pipeline controller with pipeline_execution_queue as None
is actually launching the pipeline controller on your "dev" machine, not sure why you have two of them?
might be my folder permissions hmm
That actually makes sense, also notice that if you are running under a diff user, the ~ (home folder) is different
Hi @<1523702652678967296:profile|DeliciousKoala34>
What's the clearml-server version you are working with?
Can you check with the latest RC?
pip3 install clearml==1.9.2rc2
SmallDeer34 the function Task.get_models() incorrectly returned the input model "name" instead of the object itself. I'll make sure we push a fix.
I found a different solution (hardcoding the parent tasks by hand),
I have to wonder, how does that solve the issue ?
Hi SubstantialElk6
Generally speaking here, the idea is that actual code creates a Dataset (i.e. Dataset class created from code), plus you can add some metric reporting (like table reporting) to create a preview of the data stored for better visibility, or maybe create some statistics as part of the data ingest script. Then this ingest code can be relaunched / automated. The created Dataset itself can be tagged renamed added key/value for better cataloging. wdyt?
I see... In the triton pod, when you run it, it should print the combined pbtxt. Can you print both before/after ones? so that we could compare ?
Hi GiganticTurtle0
The problem is that the packages that I define in 'required_packages' are not in the scripts corresponding
What do you mean by that? is "Xarray" a wheel package? is it instllable from a git repo (example: pip install git+ http://github.com/user/xarray/axrray.git )
BroadMole98 as one can expect long answer as well 🙂
I have a workflow with 19000 job nodes in it.
wow, 19k job nodes? as in a single pipeline 19k steps?
The main idea of the trains-agent is to allow multi-node workloads, and creating pipelines on top of a scheduler without worrying about docker packaging (done automatically for you), and to have a proper scheduler with priority (that is missing from k8s)
If the first step is just "logging" all the steps, you can easily add "Task...
Nothing except that Draft makes sense feels like the task is being prepped and Aborted feels like something went wrong
Yes guess that if we call execute remotely, without a queue, it makes sense for you to edit it...
Is that the case TrickySheep9 ?
If it is I think we should change it to draft when it is not queued. sounds good to you guys ?
Oh right, I missed the fact the helper functions are also decorated, yes it makes sense we add the tags as well.
Regarding nested pipelines, I think my main question is , are they independent or are we generating everything from the same code base?
Hi @<1730758665054457856:profile|MysteriousCrab4>
do I get to have the autoscaler feature,
You have the open source one here: None
In the managed Pro tier you have the fancy UI AWS/GCP autosclaer (with some additional extra features)
And there is the Scale/Enterprise tiers with more sophisticated features like Vault on top of that
Hi DeliciousBluewhale87
You mean per Task? Is it reporting? Is it like the project overview?
Hi WorriedParrot51
So I think what you need is to map your external code into the docker, is that correct?
Also you want to always set the PYTHONPATH.
You can achieve both by configuring the trains.conf:
Here you can always add a predefined environment and mount point, regardless of the docker image or other docker argument arguments:
https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L98
Will this solve the issue?
Hi JumpyDragonfly13 , just making sure, do you have an agent running on a remote machine ?
Can you have a direct TCP connection to the remote machine (the default port it will use is 10022)
Hi @<1523715429694967808:profile|ThickCrow29>
I am using the PipelineController with abort_on_failure set to False.
Is this a pipeline from code or from Tasks?
What is the clearml version?
Lastly, if a component fails, and another components is dependent on it's output, how would it run? if it is not dependent, why is it a child component?
Okay, make sure that in your trains.conf on all the trains-agent machine you add the following:agent.extra_docker_arguments: ["-v", "/etc/hosts:/etc/hosts",]
It is currently only enabled when using ports mode, it should be enabled by default , i.e a new feature :)
Would be very cool if you could include this use case!
I totally think we should, any chance you can open an Issue, so this feature is not lost?
Hi @<1523701868901961728:profile|ReassuredTiger98>
This should have worked, and seems like conda is not fetching the correct pytorch version (even though the conda env contains the cuda version they specify)
Let's try something, reset the Task, then edit the "Installed packages" and add:
cudatoolkit==11.1.1
Then try again.
Let's see what we get.
(The idea, is that I think conda forgets it just install cudatoolkit and assumes the env is for CPU)
This is the prerequisites of the docker service installed on the host machine (where the agent is running)
Basically follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://docs.docker.com/compose/gpu-support/
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
unless the domain is different
?
Imagine that you are working with both github and bitbucket for example, if you are using git-ssh than git will know which of the domains to send the key to. Currently there is a single user/pass entry so all domains will get the same credentials. But I think this is a rare use case.
Hi VexedCat68
What type of data is it? And what type of annotations?
Streaming data into the training process is great, but is it post quality control?
Hi ReassuredTiger98
Could you add some print ? before / after the artifact upload?
Also what's the clearml version you are using ?
HelplessCrocodile8 I just tried it, everything seems to work (ubuntu 20.04) 😞
What's the OS your are using? Python version? Is it conda ?
If the right properties are set can the profile tab be added?
I guess that is doable, that said some of the graphs are not straight forward to support like this one:
https://www.tensorflow.org/guide/images/tf_profiler/trace_viewer.png