Reputation
Badges 1
25 × Eureka!Hi @<1561885941545570304:profile|PunyKangaroo87>
What do mean by store data locally?
Like clearml-data? I.e Dataset?
You can always use file:///root/path/folder as destination, this will store everything into the local folder, is that it?
BattyLizard6 to my knowledge the main issue with fractional GPU, is there is no real restriction on GPU memory allocation (with the exception of MIG slices, which is limited in other ways).
Basically one process/container can consume the maximum GPU ram on the allocated card (this also includes http://run.ai fractional solution, at least from what I understand).
This means that developer A can allocate memory so that developer B on the same GPU will start getting out-of-memory
(Notice in a...
why would root cause the user to become nobody with group nogroup?
It is exactly the case, they inherit the cron service user (uid/gid) which would look like nobody/nogroup
I pass my dataset as parameter of pipeline:
@<1523704757024198656:profile|MysteriousWalrus11> I think you were expecting the dataset_df
dataframe to be automatically serialized and passed, is that correct ?
If you are using add_step, all arguments are simple types (i.e. str, int etc.)
If you want to pass complex types, your code should be able to upload it as an artifact and then you can pass the artifact url (or name) for the next step.
Another option is to use pipeline from dec...
WackyRabbit7 this section is what you need, un mark it, and fill it in
https://github.com/allegroai/trains/blob/c9fac89bcd87550b7eb40e6be64bd19d4384b515/docs/trains.conf#L88
Woot woot!
awesome, this RC is stable you can feel free to use it, the official release is probably due to be out next week :)
Hi TrickySheep9
You should probably check the new https://github.com/allegroai/clearml-server-helm-cloud-ready helm chart 😉
https://github.com/allegroai/clearml-server-helm-cloud-ready
at the end of the manual execution
JitteryCoyote63 that makes total sense!!
The reporting subprocess is not being updated with the new value! Let me check how we can pass it along...
Hi GrievingTurkey78
I think the main issue is the lack of support for jsonargparse
, is that correct ?
(vanilla pytorch lightning is using argpraser, which seems to work out of the box)
basically
would allow blocking the machine from being scaled-in when
Oh this is what I was missing 🙂 That makes sense to me!
So what you are saying is that the AWS autoscaler agent, when it is launching a Task, inside the container you will set "protection flag" when the Task ends, you will unset "protection flag"
Is that correct?
Hi @<1556812486840160256:profile|SuccessfulRaven86>
it does not when I run a flask command inside my codebase. Is it an expected behavior? Do you have some workarounds for this?
Hmm where do you have your Task.init ?
(btw: what's the use case of a flask app tracking?)
Then I deleted those workers,
How did you delete those workers? the autoscaler is supposed to spin the ec2 instances down when they are idle, in theory there is no need for manual spin down.
A single query will return if the agent is running anything, and for how long, but I do not think you can get the idle time ...
ResponsiveCamel97
could you attach the full log?
Quick update, I found the issue, working on a fix 🙂
Thanks CynicalBee90 I appreciate the discussion! since I'm assuming you will actually amend the misrepresentation in your table, let me followup here.
1.
SPSS license may be a significant consideration for some, and so we thought it was important to point this out clearly.
SPSS is fully open-source compliant unless you have the intention of selling it as a service, I hardly think this is any users consideration, just like anyone would be using mongodb or elastic search without think...
One additional question, if you import clearml after you call torch
does it work ?
It will store the entire content of the file, then you can edit it in the UI, and in remote it will return a new local copy of the file (based on the data in the UI) for you to read.
not sure what is the "right way" 🙂
But I do pkill -f "trains-agent --gpus 0"
This will kill a process that started "trains-agent --gpus 0" Notice it matches the cmd pattern so it has to match the way you executed the agent. You can check it with ps -Af | grep trains-agent
So there is no copying of the data to the pod, it is simply references via the EFS
Correct
if so is there any doc/examples about this?
Good point, passing to docs 🙂
https://github.com/allegroai/clearml/blob/51af6e833ddc5a8ba1efaaf75980f58616b25e85/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L123
I mean it is mentioned, but we should highlight it better
@<1671689437261598720:profile|FranticWhale40> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
making me realize that this may have been optional
I think it is optional, and this is why it was not entered in the first place.
Could you double check and just remove it from your manual pbtxt ?
Thanks @<1671689437261598720:profile|FranticWhale40> !
I was able to locate the issue, fix should be released later today (or worst case tomorrow)
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
we also provide a custom
aux-config
file. We also had to make sure to update the name inside
config.pbtxt
so that Triton is happy:
Good point, what would be the logic of the auto "config.pbtxt" patching we should employ ?