Reputation
Badges 1
2 × Eureka!Check your agent logs (through clearml console tab) and check if there isn't any error thrown.
What is probably happening is that your agent tries to upload the model but fails due to some kind of networking/firewall/port issue. For example: make sure you host your self-hosted server on 0.0.0.0 host so it's able to accept external connections other than localhost
Hi! Have you run clearml-serving create ...
first? Usually you'd make what's called a "control plane task" first, that will hold all your configuration. Step 4 in the initial setup instructions is where you'll find it!
Hi ReassuredTiger98 !
I'm not sure the above will work. Maybe I can help in another way though: when you want to set agent.package_manager.system_site_packages = true
does that mean you have a docker container with some of the correct packages installed? In case you use a docker container, there is little no real need to create a virtualenv anyway and you might use the env var CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1
to just install all packages in the root environment.
Because ev...
You're not the first one with this problem, so I think I'll ask the devs to maybe add it as a parameter for clearml-agent
in that way it will show up in the docs and you might have found it sooner. Do you think that would help?
Thanks! I'm checking now, but might take a little (meeting in between)
I'm sorry, but I will need more context. Where exactly is this log from? Can you confirm you're working with a self-hosted open source server? Which container/microservices is giving you this last error message?
Hmm, I can't really follow your explanation. The removed file SHOULD not exist right? 😅 And what do you mean exactly with the last sentence? An artifact is an output generated as part of a task. Can you show me what you mean with screenshots for example?
Indeed that should be the case. By default debian is used, but it's good that you ran with a custom image, so now we know it's not clear that more permissions are needed
This looks to me like a permission issue on GCP side. Do your GCP credentials have the compute.images.useReadOnly
permission set? It looks like the worker needs that permission to be able to pull the images correctly 🙂
Hmm I think we might have made it more clear in the documentation then? How would you have been helped before you figured it out? (great job BTW, thanks for the updates on it :))
Allright, a bit of searching later and I've found 2 things:
- You were right about the task! I've staged a fix here . It basically detects whether a task is already running (e.g. from the pipelinedecorator component) and if so, uses that task instead. We should probably do this for all of our integrations.
- But then I found another bug. Basically the pipeline decorator task wou...
Hi @<1523701062857396224:profile|AttractiveShrimp45> , I'm checking your issue myself. Do you see any duplicate experiments in the summary table?
It is not filled in by default?
projects/debian-cloud/global/images/debian-10-buster-v20210721
Hey @<1523701949617147904:profile|PricklyRaven28> , So as discussed above there were 2 issues. The first one is still waiting on the second, it's on the backlog of our devs and should be done soon(tm).
That said, in the meantime I also wanted to do fun stuff with transformers, so I've written a quick hack that deals with the bug. It's bascially 2 functions that keep track of which types of keys are in the dict.
def cast_keys_to_string(d, changed_keys=dict()):
nd = dict()
for k...
Hi @<1529633462108033024:profile|SucculentCrab55> ! In which step do you get this error? I assume get_data? Does it work locally?
It should, but please check first. This is some code I quickly made for myself. It did make tests for it, but it would be nice to hear from someone else that it worked (as evidenced by the error above 😅 )
No inputs and outputs are ever set automatically 🙂 For e.g. Keras you'll have to specify it using the CLI when making the endpoint, so Triton knows how to optimise as well as set it correctly in your preprocessing so Triton receives the format it expects.
Nope! The helm chart sets up all the infrastructure to run everything. What exactly to run is decided using the clearml-serving CLI. Using it, you can swap out models, setup A/B testing of different versions, do canary rollouts etc. But the HELM stack is there only to run what you defined using the CLI
Hi @<1533257278776414208:profile|SuperiorCockroach75>
I must say I don't really know where this comes from. As far as I understand the agent should install the packages exactly as they are saved on the task itself. Can you go to the original experiment of the pipeline step in question (You can do this by selecting the step and clicking on Full Details" in the info panel), there under the execution tab you should see which version the task detected.
The task itself will try to autodetect t...
The above works for me, so if you try and the command line version does not work, there might be a bug. Please post the exact commands you use when you try it 🙂
Did you by any chance save the checkpoint without any file extention? Or with a weird name containing slashes or points? The error seems to suggest the content type was not properly parsed
I'm able to reproduce, but your workaround seems to be the best one for now. I tried launching with clearml-task
command as well, but we have the same issue there: only argparse arguments are allowed.
AgitatedDove14 any better workaround for this, other than waiting for the jsonargparse issue to be fixed?
Nice find! I'll pass it through to the relevant devs, we'll fix that right up 🙂 Is there any feedback you have on the functionality specifically? aka, would you use alias give what you know now or would you e.g. name it differently?
This update was just to modernize the example itself 🙂
I can see 2 kinds of errors:Error: Failed to initialize NVML
and Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
These 2 lines make me think something went wrong with the GPU itself. Chances are you won't be able to run nvidia-smi
this looks like a non-clearml issue 🙂 It might be that triton hogs the GPU memory if not properly closed down (doubl ctrl-c). It says the driver ver...
What might also help is to look inside the triton docker container while it's running. You can check the example, there should be a pbtxt file in there. Just to doublecheck that it is also in your own folder
Most likely you are running a self-hosted server. External embeds are not available for self-hosted servers due to difficult network routing and safety concerns (need access from the public internet). The free hosted server at app.clear.ml does have it.
Hi NuttyCamel41 !
Your suspicion is correct, there should be no need to specify the config.pbtxt
manually, normally this file is made automatically using the information you provide using the command line.
It might be somehow silently failing to parse your CLI input to correctly build the config.pbtxt
. One difference I see immediately is that you opted for "[1, 64]"
notation instead of the 1 64
notation from the example. Might be worth a try to change the input for...
Doing this might actually help with the previous issue as well, because when there are multiple docker containers running they might interfere with each other 🙂