Hi @<1541954607595393024:profile|BattyCrocodile47>
Can you help me make the case for ClearML pipelines/tasks vs Metaflow?
Based on my understanding
- Metaflow cannot have custom containers per step (at least I could not find where to push them)
- DAG only execution. I.e. you cannot have logic driven flows
- cannot connect git repositories to different component in the pipeline
- Visualization of results / artifacts is rather limited
- Only Kubernetes is supported as underlying prov...
Would this be best if it were executed in the Triton execution environment?
It seems the issue is unrelated to the Triton ...
Could I use theย
clearml-agent build
ย command and theย
Triton serving engine
ย task ID to create a docker container that I could then use interactively to run these tests?
Yep, that should do it ๐
I would start simple, no need to get the docker itself it seems like clearml credentials issue?!
Like what would be the exact query given an endpoint, for requests per sec.
You mean in Grafana ?
Sorry I missed the additional "." in the _update_requirements
Let me check ....
Hi @<1569858449813016576:profile|JumpyRaven4>
- The gunicorn logs do not show anything including any error or trace of the 502 only siege reports the 502 as well as the ALB.Is this an ALB or an ELB ?
What's the timeout its configured?
Do you have GPU instances as well? what's theclearml-serving-inference
docker version ?
Regrading the first direction, this was just pushed ๐
https://github.com/allegroai/clearml/commit/597a7ed05e2376ec48604465cf5ebd752cebae9c
Regrading the opposite direction:
That is a good question, I really like the idea of just adding another section named Datasets
SucculentBeetle7 should we do that automatically?
Hi @<1610083503607648256:profile|DiminutiveToad80>
Yes, it does. They are also cached by default (on the machine with the agent)
None
Hi UptightBeetle98
The hyper parameter example assumes you have agents ( trains-agent
) connected to your account. These agents will pull the jobs from the queue (which they are now, aka pending) setup the environment for the jobs (venv or docker+venv) and execute the job with the specific arguments the optimizer chose.
Make sense ?
RC you can see on the main readme, (for some reason the Conda badge will show RC and the PyPi won't)
https://github.com/allegroai/clearml/
If you take a look here, the returned objects are automatically serialized and stored on the files server or object storage, and also deserialized when passed to the next step.
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
You can of course do the same manually
GiganticTurtle0
What do you mean by "reuse_last_task_id" ? each component is always a new Task generated (unless it is cached, and then it will reuse the previous executed)
What am I missing here?
RoundMosquito25 how is that possible ? could it be they are connected to a different server ?
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind ๐
How are you starting the agent?
JitteryCoyote63 I remember something with "!" in the name or maybe "/" in the name that might cause this behavior. May I suggest checking with clearml-server 1.3 ?
The other way will not work, as if you start with "pip" you cannot fail ... (if you fail it's in run time which is too late)
Hi JumpyPig73
import data from old experiments into the dashboard.
what do you mean by "old experiments" ?
Hi @<1607909176359522304:profile|UnevenCow76>
followed the below documentation to implement the clearml monitoring using prometheus and grafana
Did you try following this example, it includes both deploying a model and adding grafana metrics:
None
Found it
GiganticTurtle0 you are ๐งจ ! thank you for stumbling across this one as well.
Fix will be pushed later today ๐
The easiest if export_task / update_task:
https://allegro.ai/docs/task.html#trains.task.Task.export_task
https://allegro.ai/docs/task.html#trains.task.Task.update_task
Check the structure returned by export_task, you'll find the entire configuration test there,
then, you can use that to update back the Task.
BTW:
Partial update is also supported...
ShortElephant92 yep, this is definitely enterprise feature ๐
But you can configure user/pass on the open source, even store as hasedh the passwords if you need.
Hi JitteryCoyote63
Yes I think you are correct, since torch is installed automatically as a requirement by pip, the agent is not aware of it, so it cannot download the correct one.
I think the easiest is just to add the torch as additional package# call before Task.init() Task.add_requirements(package_name="torch", package_version="==1.7.1")
PungentLouse55 could you test with 0.15.2rc0 see if there is any difference ?
TenseOstrich47 / PleasantGiraffe85
The next version (I think releasing today) will already contain scheduling, and the next one (probably RC right after) will include triggering. That said currently the UI wizard for both (i.e. creating the triggers), is only available in the community hosted service. That said I think that creating it from code (triggers/schedule) actually makes a lot of sense,
pipeline presented in a clear UI,
This is actually actively worked on, I think Anxious...
E.g. I might need to have different N-numbers for the local and remote (ClearML) storage.
Hmm yes, that makes sense
That'd be a great solution, thanks! I'll create a PR shortly
Thank you! ๐ ๐คฉ
SlipperyDove40 Yes there isTRAINS_CONFIG_FILE
https://allegro.ai/docs/faq/faq/#trains-configuration
Done!
Thanks
fatal: unable to find a suitable socket path; use --socket
ย )
I think that's the root cause, we should probably also add https://github.com/allegroai/trains-agent/issues/16
PungentLouse55 you can find the metrics in the "original" (aka base template) experiment.
agent.cuda_driver_version = ...
agent.cuda_runtime_version = ...
Interesting idea! (I assume for reporting only, not configuration)
... The agent mentionned used output from nvcc (2) ...
The dependencies I shared are not how the agent works, but how Nvidia CUDA works ๐
regrading the cuda check with nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvid...
well cudnn is actually missing from the base image...