Reputation
Badges 1
25 × Eureka!I found "scheduler" on allegroai github, is it something related to the case I want to make?
MoodyCentipede68 it is exactly what you are looking for 🙂
Do notice that you need to make sure you have your services queue configured and running for that to work 🙂
Hi, I changed it to 1.13.0, but it still threw the same error.
This is odd, just so we can make the agent better, any chance you can send the Task log ?
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?
btw: both should work fine
Hi @<1542316991337992192:profile|AverageMoth57>
is this a follow up of this thread? None
BTW: Can you also please test with the latest clearml version , 1.7.2
Hi @<1523701949617147904:profile|PricklyRaven28>
Sorry, we missed that one
we need to invoke it with
accelerate launch
so we use
subprocess.run
So you have two options, either you change the script entry of the Task from your " script.py " to" -m accelerate launch script.py
or you manually do that inside your entry point (i.e. call accelerate launch)
BTW, I "think" we added an "auto detect" for it, so that if you launched it manually this wa...
BTW
/home/local/user/.clearml/venvs-builds/3.7/bin/python: can't open file 'train.py': [Errno 2] No such file or directory
This error is from the agent, correct? it seems it did not clone the correct code, is train.py committed to the repository ?
Container environment setup overhead?
Xeon E3-1240: 4 - 5 hours!wow... yes definitely worth upgrading 🙂
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, ...
Which works for my purposes. Not sure if there's a good way to automate it
Interesting, so if we bind to hydra.compose it should solve the issue (and of course verify we are running on a jupyter notebook)
wdyt?
WickedGoat98 are you running the agent with --gpus ?
I would say 4vCPUs and 512GB storage , but it really depends on the load you will put on it
I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
WittyOwl57
To get task Id's use (e.g. all the tasks of a specific project):task_ids = Task.query_tasks(project_name="examples", task_filter={'status': ["completed"])Then per task:
` for t_id in tasks_id:
t = Task.get_task(t_id)
conf_dict = t.get_configuration_as_dict(name="filter")
task_param = t.get_parameters()
task_param['filter'] = conf_dict
# this is to enable to forcefully update parameters post execution
t.mark_started(force=True)
# update hyper-parame...
Exactly! nice 🎉
I have timeseries dataset with dimension 1,60,1 which the first dimension is number of data, the second one is timestep
I think it should be --input-size 1 60 ` if the last dimension is the batch size?
(BTW: this goes directly to Triton configuration, it is the information Triton needs in order to run the model itself)
DefiantHippopotamus88 you can create a custom endpoint and do that, but it will be running I the same instance , is this what you are after? Notice that Triton actually supports it already, you can check the pytorch example
Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system
Hi @<1569858449813016576:profile|JumpyRaven4>
- The gunicorn logs do not show anything including any error or trace of the 502 only siege reports the 502 as well as the ALB.Is this an ALB or an ELB ?
What's the timeout its configured?
Do you have GPU instances as well? what's theclearml-serving-inferencedocker version ?
Hi @<1547028116780617728:profile|TimelyRabbit96>
Trying to do model inference on a video, so first step in
Preprocess
class is to extract frames.
Basically this depends on the RestAPI, usually would will be sending a link to data to be processed and returned Synchronously
What you should have a custom endpoint doing the extraction, send Raw data into another endpoint doing the model inference, basically think "pipeline" end points:
[None](https://github.com/allegro...
Hi @<1641611252780240896:profile|SilkyFlamingo57>
. It is not taking a new pull from Git repository.
When you are saying it's not trying to get the latest, are you referring to a new run of the pipeline, and then the component being pulled is Not pulling the latest from the branch, is that the issue?
When you click on the component Task details (i.e. right hand side panel "Full details"), what's the commit ID you have?
Lastly, is the component running on the same machine as the prev...
The image is
allegroai/clearml:1.0.2-108
Yep, that makes sense, seems like a backwards compatibility issue
Exporter would be nice I agree, not sure it is on the roadmap at the moment 😞
Should not be very complicated to implement if you want to take a stab at it.
I think this is the only mount you need:
Data persisted in every Kubernetes volume by ClearML will be accessible in /tmp/clearml-kind folder on the host.
SuccessfulKoala55 is this correct ?
Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
Hi @<1524922424720625664:profile|TartLeopard58>
Yes this is the default it is designed to serve multiple models and scale horizontally