Reputation
Badges 1
25 × Eureka!... the one for the last epoch and not the best one for that experiment,
well
Now we realized there is an option tu use
"min_global"
on the sign, is this what we need?
Yes π (or max_global)
potential sources of slow down in the training code
Is there one?
but cant catch that only one way for service queue or I can experiments with that?
UnevenOstrich23 I'm not sure what exactly is the question, but if you are asking weather this is limited, the answer is no it is not limited to that use case.
Specifically you can run as many agents in "services-mode" pulling from any queue/s that you need, and they can run any Task that is enqueued on those queues. There is no enforced limitation. Did that answer the question ?
Thanks LethalCentipede31 , i think (3) is the most stable solution (as it doesn't require to add another package, and should work on any python version / OS)
This is actually what we do for downloads .
DO you know if there is a minimum required python requests version ?
JitteryCoyote63
I agree that its name is not search-engine friendly,
LOL π
It was an internal joke the guys decided to call it "trains" cause you know it trains...
It was unstoppable, we should probably do a line of merch with AI π π
Anyhow, this one definitely backfired...
For example, store inference results, explanations, etc and then use them in a different process. I currently use separate database for this.
You can use artifacts for complex data then retrieve them programatically.
Or you can manually report scalers / plots etc, with Logger
class, also you can retrive them with task.get_last_scalar_metrics
I see that you guys have made a lot of progress in the last two months! I'm excited to dig inΒ
Thank you!
You can further di...
You mean why you have two processes ?
I'm assuming your are looking for the AWS autoscaler, spinning EC2 instances up/down and running daemons on them.
https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py
https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler
Hi RobustRat47
What do you mean by "log space for hyperparameter" , what would be the difference ? (Notice that on the graph itself you can switch to log scale when viewing in the UI) ?
Or are you referring to the hyper parameter optimization, allowing you to add log space ?
I see..
Generally speaking If that is the case, I would think it might be better to use the docker mode, it offers way more stable environment, regardless on the host machine runinng the agent. Notice there is no need to use custom containers, as the agent will basically run the venv process, only inside a container, allowing you to reuse offf the shelf containers.
If you were to add this, where would you put it? I can use a modified version ofΒ
clearml-agent
Yep, that would b...
I think you are correct, it seems like it is missing requirements to boto/azure/google (I will make sure this is added). In the meantime, you can stop the "triton serving engine" Task, reset it, add boto3 to the installed packages and relaunch.
That said your main issue might be packaging the python model. Basically you need to create a model from the entire folder (with whatever there is inside the folder), then Triton should be able to run it (if the config.pbtxt is correct).
` m = OutputMo...
On my to do list, but will have to wait for later this week (feel free to ping on this thread to remind me).
Regrading the issue at hand, let me check the requirements it is using.
Could not find a version that satisfies the requirement pytorch~=1.7.1
Seems like pytorch 1.7.1 has no package for python 3.7 ?
I have to specify the full uri path ?
No it should be something like " s3://bucket "
the model files management is not fully managed like for the datasets ?
They are π
See the log:
Collecting keras-contrib==2.0.8
File was already downloaded c:\users\mateus.ca\.clearml\pip-download-cache\cu0\keras_contrib-2.0.8-py3-none-any.whl
so it did download it, but it failed to pass it correctly ?!
Can you try with clearml-agent==1.5.3rc2
?
pywin32 isnt in my requirements file,
CloudySwallow27 whats the OS/env ?
(pywin32 is not in the direct requirements of the agent)
I want to be able to install the venv in multiple servers and start the "simple" agents in each one on them. You can think of it as some kind of one-off agent for a specific (distributed) hyperparameter search task
ExcitedFish86 Oh if this is the case:
in your cleaml.conf:agent.package_manager.type: conda agent.package_manager.conda_env_as_base_docker: true
https://github.com/allegroai/clearml-agent/blob/36073ad488fc141353a077a48651ab3fabb3d794/docs/clearml.conf#L60
https://git...
btw: you can also configure --extra-index-url in the agent's clearml.conf
, how do different tasks know which arguments were already dispatched if the arguments are generated at runtime?
A bit of how clearml-agent works (and actually on how clearml itself works).
When running manually (i.e. not executed by an agent), Task.init (and similarly task.connect etc.) will log data on the Task itself (i.e. will send arguments /parameters to the server), This includes logint the argparser for example (and any other part of the automagic or manuall connect).
When run...
so for example if there was an idle GPU and Q3 take it and then there is a task comes to Q2 which we specified 3GPU but now the Q3 is taken some of these GPU what will happen
This is a standard "race" the first one to come will "grab" the GPU and the other will wait for it.
I'm pretty sure enterprise edition has preemption support, but this is not currently part of the open source version (btw: also the dynamic GPU allocation, I think, is part of the enterprise tier, in the opensource ...
Basically two options, spin the clearml-k8s-glue, as a k8s service.
This service takes clearml jobs and creates k8s job on your cluster.
The second option is to spin agents inside pods statically, then inside the pods the agent work in venv model.
I know the enterprise edition has more sophisticated k8s integration where the glue also retains the clearml scheduling capabilities.
https://github.com/allegroai/clearml-agent/#kubernetes-integration-optional
Hi @<1628565287957696512:profile|AloofBat92>
Yeah the name is confusing, we should probably change that. The idea is it is a low code / high code , train your own LLM and deploy it. Not really chatgpt 1:1 comparison, more like, GenAI for enterprises. make sense ?
Hi OddShrimp85
I think numpy 1.24.x is broken in a lot of places we have noticed scikit breaks on it, TF and others π
I will make sure we fix this one
maybe this can cause the issue?
Not likely.
In the original pipeline (the one executed from the Pycharm) do you see the "Pipeline" section under Configuration -> "Config objects" in the UI?
The first pipeline
Β step is calling init
GiddyPeacock64 Is this enough to track all the steps?
I guess my main question is every step in the pipeline an actual Task/Job or is it a single small function?
Kubeflow is great for simple DAGs but when you need to build more complex logic it is usually a bit limited
(for example the visibility into what's going on inside each step is missing so you cannot make a decision based on that).
WDYT?
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )
Might be! what's the actual value you are passing there?
Regrading the helm, how did you get the link, ? http://github.io ? and the subdomain allegroai?