
Reputation
Badges 1
2 × Eureka!I see. Are you able to manually boot a VM on GCP and then manually SSHing into it and running the docker login command from there? Just to be able to cross out networking or permissions as possible issues.
Hi @<1529633462108033024:profile|SucculentCrab55> ! In which step do you get this error? I assume get_data? Does it work locally?
It is not filled in by default?
projects/debian-cloud/global/images/debian-10-buster-v20210721
Maybe you can add https://clear.ml/docs/latest/docs/references/sdk/automation_controller_pipelinecontroller/#set_default_execution_queue to your pipelinecontroller, only have the actual value be linked to a pipeline parameter? So when you create a new run, you can manually enter a queue name and the parameter will be used by the pipeline controller script to set the default execution queue.
I can see 2 kinds of errors:Error: Failed to initialize NVML
and Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
These 2 lines make me think something went wrong with the GPU itself. Chances are you won't be able to run nvidia-smi
this looks like a non-clearml issue 🙂 It might be that triton hogs the GPU memory if not properly closed down (doubl ctrl-c). It says the driver ver...
Thank you so much, sorry for the inconvenience and thank you for your patience! I've pushed it internally and we're looking for a patch 🙂
Hi ExasperatedCrocodile76
In terms of the Alias, yes, you don't need to specify one. Recently we changed things that if you do add an alias, the dataset alias and ID will automatically be added in the experiment manager. So it's more a print that says: "hey, you might want to try this!"
Second, this is my own version of the example that's no longer maintained. The official version is here: https://github.com/allegroai/clearml-blogs/tree/master/urbansounds8k
We changed the link in that v...
ExuberantBat52 The dataset alias thing giving you multiple prompts is still an issue I think, but it's on the backlog of our devs 😄
Hmm I think we might have made it more clear in the documentation then? How would you have been helped before you figured it out? (great job BTW, thanks for the updates on it :))
Hey @<1526371965655322624:profile|NuttyCamel41> Thanks for coming back on this and sorry for the late reply. This looks like a bug indeed, especially because it seems to be working when coming from the clearml servers.
Would you mind just copy pasting this info into a github issue on clearml-serving repo? Then we can track the progress we make at fixing it 🙂
Hi @<1547028116780617728:profile|TimelyRabbit96> Awesome that you managed to get it working!
Wow! Awesome to hear :D
Hi PanickyMoth78 ,
I've just recreated your example and it works for me on clearml==1.6.2
but indeed not on clearml==1.6.3rc1
which means we have some work to do before the full release 🙂 Can you try on clearml==1.6.2
to check that it does work there?
Yes, with docker auto-starting containers is def a thing 🙂 We set the containers to restart automatically (a reboot will do that too) for when the container crashes it will immediately restarts, let's say in a production environment.
So the best thing to do there is to use docker ps
to get all running containers and then kill them using docker kill <container_id>
. Chatgpt tells me this command should kill all currently running containers:docker rm -f $(docker ps -aq)
And I...
Are you running a self-hosted/enterprise server or on app.clear.ml? Can you confirm that the field in the screenshot is empty for you?
Or are you using the SDK to create an autoscaler script?
@<1547028116780617728:profile|TimelyRabbit96>
Pipelines has little to do with serving, so let's not focus on that for now.
Instead, if you need a ensemble_scheduling
block, you can use the CLI's --aux-config
command to add any extra stuff that needs to be in the config.pbtxt
For example here, under the Setup section step 2, we use the --aux-config
flag to add a dynamic batching block: None
In order to prevent these kinds of collisions it's always necessary to provide a parent dataset ID at creation time, so it's very clear which dataset and updated one is based on. If multiple of them happen at the same time, they won't know of each other and both use the same dataset as the parent. This will lead to 2 new versions based on the same parent dataset, but not sharing data with each other. If that happens, you could create a 3rd dataset (potentially automatically) that can have bot...
Can you share the exact error message? That will help a ton!
I'm using image and machine image interchangeably here. It is quite weird that it is still giving the same error, the error clearly asked for "Required 'compute.images.useReadOnly' permission for 'projects/image-processing/global/images/image-for-clearml'"
🤔
Also, now I see your credentials even have the role of compute admin, which I would expect to be sufficient.
I see 2 ways forward:
- Try running the autoscaler with the default machine image and see if it launches correctly
- Dou...
Indeed that should be the case. By default debian is used, but it's good that you ran with a custom image, so now we know it's not clear that more permissions are needed
Wow awesome! Really nice find! Would you mind compiling your findings to a github issue, then we can help you search better :) this info is enough to get us going at least!
Hi! You should add extra packages in your docker-compse through your env file, they'll get installed when building the serving container. In this case you're missing the transformers package.
You'll also get the same explanation here .
It depends on how complex your configuration is, but if config elements are all that will change between versions (i.e. not the code itself) then you could consider using parameter overrides.
A ClearML Task can have a number of "hyperparameters" attached to it. But once that task is cloned and in draft mode, one can EDIT these parameters and change them. If then the task is queued, the new parameters will be injected into the code itself.
A pipeline is no different, it can have pipeline par...
Hi UnevenBee3 , the OptimizerOptuna class should already be able to prune any bad tasks, provided the model itself is iteration-based (so no SVM etc. need iterations for early stopping). You can read our blogpost here: https://clear.ml/blog/how-to-do-hyperparameter-optimization-better/
Hey! Sorry, didn't fully read your question and missed that you already did it. It should not be done inside the clearm-serving-triton
service but instead inside the clearml-serving-inference
service. This is where the preprocessing script is ran and it seems to be where the error is coming from.
I'll update you once I have more!
This update was just to modernize the example itself 🙂
ExasperatedCrocodile76 I have ran the example and even with my fix it is not ideal. The sample is relatively old now, so I'll revise it asap. Thank you for noticing it!