Badges 12 × Eureka!
I still have my tasks I ran remotely and they don't show any uncommitted changes. @<1540142651142049792:profile|BurlyHorse22> are you sure the remote machine is running transformers from the latest github branch, instead of from the package?
If it all looks fine, can you please install transformers from this repo (branch main) and rerun? It might be that not all my fixes came through
Great to hear! Then it comes down to waiting for the next hugging release!
effectively making us lose 24 hours of GPU compute
Oof, sorry about that, man 😞
Well I'll be had, you're 100% right, I can recreate the issue. I'm logging it as a bug now and we'll fix it asap! Thanks for sharing!!
If I'm not mistaken:
Fileserver - Model files and artifacts
MongoDB - all experiment objects are saved there.
Elastic - Console logs, debug samples, scalars all is saved there.
Redis - caching regarding agents I think
It looks like you need to add the
compute.imageUser role to your credentials: None
Did you by any chance set up the autoscaler to use a custom image? It's trying to use ‘projects/image-processing/global/images/image-for-clearml’ which is a path I don't recognise. Is this your own, custom image? If so, we can add this role to the documentation as required when using a custom image 🙂
It should, but please check first. This is some code I quickly made for myself. It did make tests for it, but it would be nice to hear from someone else that it worked (as evidenced by the error above 😅 )
Wow! Awesome to hear :D
I'll update you once I have more!
Damn it, you're right 😅
# Allow ClearML access to the training args and allow it to override the arguments for remote execution args_class = type(training_args) args, changed_keys = cast_keys_to_string(training_args.to_dict()) Task.current_task().connect(args) training_args = args_class(**cast_keys_back(args, changed_keys))
Doing this might actually help with the previous issue as well, because when there are multiple docker containers running they might interfere with each other 🙂
Indeed that should be the case. By default debian is used, but it's good that you ran with a custom image, so now we know it's not clear that more permissions are needed
Great! Please let me know if it works when adding this permission, we'll update the docs in a jiffy!
Are you running a self-hosted/enterprise server or on app.clear.ml? Can you confirm that the field in the screenshot is empty for you?
Or are you using the SDK to create an autoscaler script?
It is not filled in by default?
I can see 2 kinds of errors:
Error: Failed to initialize NVML and
Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
These 2 lines make me think something went wrong with the GPU itself. Chances are you won't be able to run
nvidia-smi this looks like a non-clearml issue 🙂 It might be that triton hogs the GPU memory if not properly closed down (doubl ctrl-c). It says the driver ver...
What might also help is to look inside the triton docker container while it's running. You can check the example, there should be a pbtxt file in there. Just to doublecheck that it is also in your own folder
This looks to me like a permission issue on GCP side. Do your GCP credentials have the
compute.images.useReadOnly permission set? It looks like the worker needs that permission to be able to pull the images correctly 🙂
I'm using image and machine image interchangeably here. It is quite weird that it is still giving the same error, the error clearly asked for
"Required 'compute.images.useReadOnly' permission for 'projects/image-processing/global/images/image-for-clearml'" 🤔
Also, now I see your credentials even have the role of compute admin, which I would expect to be sufficient.
I see 2 ways forward:
- Try running the autoscaler with the default machine image and see if it launches correctly
Hi Alejandro! I'm running the exact same Chromium version, but haven't encountered the problem yet. Are there specific parameter types where it happens more often?
Nice! Well found and thanks for posting the solution!
May I ask out of curiosity, why mount X11? Are you planning to use a GUI app on the k8s cluster?
Could you try and create a new task with the tag already added? Instead of adding a tag on an existing task. It should work then. If it does, this might be a bug? Or if not, a good feature to exist 🙂
Hey @<1523701949617147904:profile|PricklyRaven28> I'm checking! Have you updated anything else and on which exact commit of transformers are you now?
I'm not quite sure what you mean here? From the docs it seems like you should be able to simply send an HTTP request to the localhost url to get the metrics. Is this not working for you? Otherwise, all the metrics end up in Prometheus, so you can also query that instead or use something like Grafana to visualize it
Now worries! Just so I understand fully though: you were already using the patch with success from my branch. Now that it has been merged into transformers main branch you installed it from there and that's when you started having issues with not saving models? Then installing transformers 4.21.3 fixes it (which should have the old clearml integration even before the patch?)
No worries! And thanks for putting in the time.
To be honest, I'm not completely sure as I've never tried hundreds of endpoints myself. In theory, yes it should be possible, Triton, FastAPI and Intel OneAPI (ClearML building blocks) all claim they can handle that kind of load, but again, I've not tested it myself.
To answer the second question, yes! You can basically use the "type" of model to decide where it should be run. You always have the custom model option if you want to run it yourself too 🙂
Hi! You should add extra packages in your docker-compse through your env file, they'll get installed when building the serving container. In this case you're missing the transformers package.
You'll also get the same explanation here .
I must say I don't really know where this comes from. As far as I understand the agent should install the packages exactly as they are saved on the task itself. Can you go to the original experiment of the pipeline step in question (You can do this by selecting the step and clicking on Full Details" in the info panel), there under the execution tab you should see which version the task detected.
The task itself will try to autodetect t...