Reputation
Badges 1
42 × Eureka!Hi @<1523701118159294464:profile|ExasperatedCrab78> , I have a sad update on this issue. It does not seem to be completely solved yet. 😕 But I think I can at least describe it a bit better now:
- Models which are located on the clearML servers (created by
Task.init(..., output_uri=True)) still run perfectly. - Models which are located on azure blob storage make different problems in different scenarios (which made me think we resolved this issue):- When I start the docker con...
Hi @<1523701118159294464:profile|ExasperatedCrab78> , thanks for your answer. 🙂 Yes sure! I will create the issue right away.
Hi @<1523701205467926528:profile|AgitatedDove14> , I serialized a sklearn MinMaxScaler object which I created on the training data using pickle. So when serving the model I would like to load that pickle file in the preprocess script such that I can perform the same normalization as done during training. Unless there is a better practice applying the same normalization during training and serving time.
Hi ExasperatedCrab78 , thanks for your answer! In fact I used your recommended format for passing input and output size before and changed it in my debugging process. I have just tried again but got the same error message.
Also thanks for the hint to check the log for warnings I wil do this in a moment.
By the way, the example which worked for me in the beginning also produces the same error now poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory . So there really seems to be something wrong with the docker containers.
Ok, so I killed all docker containers (the proposal by chatgpt did not work for me, but your commands did). The result is that we have one less warning. The warning clearml-serving-triton | Warning: more than one valid Controller Tasks found, using Task ID=4709b0b383a04bb1a033e99fd325dcbf seems to be solved. All remaining errors come up in the clearml-serving-triton service and this is the log I get
CLEARML_SERVING_TASK_ID=9309c20af9244d919b0f063642198c57
CLEARML_TRITON_POLL...
I got the last bit of my issue solved. I thought for a start it would be easier to provide the AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY in my 'example.env' in plain text and not access my environment variables because I was not sure about the syntax. Turns out the syntax is not AZURE_STORAGE_KEY="mystoragekey123" but AZURE_STORAGE_KEY=mystoragekey123 . Same for AZURE_STORAGE_ACCOUNT . Also the syntax for accessing my environment variables is just the same as in the clear...
Yes I am running the agent by calling clearml-agent daemon --queue default in my virtual environment on my local computer.
You're very welcome, thank you again for the great support. :)) I followed the instructions of the clearml-serving README on github None . There is one section called 'Optional: advanced setup - S3/GS/Azure access'. Maybe the syntax could be added there? I also saw the additonal link to configure the storage access, but this site focuses on setting up the clearml.conf and I was not sure how and if I could transfer it to the docker .env-file.
A...
What do you mean by "How are you creating the model?"? I executed a pytorch model training saved a traced version of the model so that saved with the executed task. This was also no problem with the docker container setup.
My pre- and postprocessing code should be correct, because it already worked when I used the docker container clearml-serving setup. But in case you want to have a look, here it is:
Ok, I have some weird update... I shut down and restarted the docker container just to get fresh logs and now I am getting the following error message by clearml-serving-triton
` clearml-serving-triton | clearml-serving - Nvidia Triton Engine Controller
clearml-serving-triton | Warning: more than one valid Controller Tasks found, using Task ID=433aa14db3f545ad852ddf846e25dcf0
clearml-serving-triton | ClearML Task: overwriting (reusing) task id=350a5a919ff648148a3de4483878...
Yes, I also find that very weird... I start the hyperparameter optimization via python code using the HyperParameterOptimizer class of clearml. Which configurations are you explicitely interested in?
When comparing the logs of the two hpo tasks it seems like no logs of the subtasks are getting to the hpo task. So maybe this is the reason for the infinitely long running subtask? But what does the azure package have to do with that?
Hi @<1523701205467926528:profile|AgitatedDove14> you are right for the docker setup. But with the k8s setup I get the error Poll failed for model directory 'advanced_basic_classifier.pytorch': unexpected 'platform' and 'backend' pair, got:, pytorch when I do not specify the platform, which sounds like I should specify the platform.
Btw if I do not name the model after the 'model.<backend_name>' convention then I get this error
`Poll failed for model directory 'advanced_basic_classifi...
Hi @<1523701205467926528:profile|AgitatedDove14> the config.pbtxt for 1. looks like this: (because I do not specify input and output type and size within the command)
backend: "pytorch"
platform: "pytorch_libtorch"
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [1, 64]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [1, 11]
}
]
while the config.ptxt for 2. looks like this: (because everything else ...
Hi @<1523701205467926528:profile|AgitatedDove14> , exactly!
I just tried the pytorch example from the clearml-serving repo and got the error about the wrong model name Poll failed for model directory 'test_model_pytorch': Invalid model name: Could not determine backend for model 'test_model_pytorch' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
Hi @<1523701205467926528:profile|AgitatedDove14> , now there are some interesting things happening: Like I wrote before I got the error message but one minute later the model was added successfully nonetheless. The log says
E0603 09:43:01.652550 41 model_repository_manager.cc:996] Poll failed for model directory 'test_model_pytorch': Invalid model name: Could not determine backend for model 'test_model_pytorch' with no backend in model configuration. Expected model name of the form 'mo...
Hi @<1523701827080556544:profile|JuicyFox94> I figured out what the problem is! For some recent experimentation I set an acces_key and secret_key as environment variables in my os. When I deleted them everything worked fine so the environment variables overwrote the keys given by the clearml.conf. Is that the desired default behaviour?
And just one tip for everbody having similar problems: Switch to using the SDK instead of the CLI for better debugging. This helped me to find the cause of m...
Hi @<1523701087100473344:profile|SuccessfulKoala55> , thanks for your message! 🙂 I am aware that the console is also logged on the server, but I somehow find it not optimal to look for relevant information in the console log and would like to place the information in a more structured way.
Full log:
` Current configuration (clearml_agent v1.5.1, location: C:/Users/USER~1/AppData/Local/Temp/.clearml_agent.g6ysfs_g.cfg):
agent.worker_id = HPZBook:0
agent.worker_name = HPZBook
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >= '3.10'
agent.package_manager.system_site_packages = false
...
Hi @<1523701070390366208:profile|CostlyOstrich36> , I just have solved the issue! :) After calling clearml-serving create --name "model serving" the printed task id has to be filled in the values.yaml of the clearml-serving helm chart under clearml.servingTaskId. After installing the helm chart, the draft of the service task is started automatically so there is no need to manually enqueue it.
Would it be possible to add this info to the docs? Maybe a small hint on this page [None](https...
Thank you, I did not think about that. It helped a lot! I found out that the problem causing the unicode error was, that only the 'python' command was set up on my windows machine, but not the 'python3' command. This was the exact error for documentation:
DEBUG:clearml_agent.commands.worker:Searching for python3
Traceback (most recent call last):
File "C:\Users\User\venvs\clearml\lib\site-packages\clearml_agent\helper\process.py", line 204, in normalize_exception
yield
File "C:\...
Hi @<1523701435869433856:profile|SmugDolphin23> , thanks for your question. For now I just deleted the requirements.txt and let ClearML track the requirements automatically and it works. For long term I would still like to use a requirements.txt, so I will come back to this topic a little later.
Hi @<1523701070390366208:profile|CostlyOstrich36> , of course! Here it is (with blurred urls, paths and account names)
This is the log of the hpo task with the newest azure azure-storage-blob version
The clearml-data call results in these two lines in the ingress logs. Is that sufficient or would you like to have a larger section of the log?
2024/03/26 16:07:10 [warn] 2879#2879: *1151249 upstream sent duplicate header line: "server: clearml", previous value: "Server: Werkzeug/3.0.1 Python/3.9.18", ignored while reading response header from upstream, client: ***.***.***.22, server: api.clearml.****.com, request: "GET /auth.login HTTP/1.1", upstream: "
", host: "api.clearm...
Hi @<1523701070390366208:profile|CostlyOstrich36> , thanks for your answer! I just updated the 'azure_storage_blob' package to the newest version and got some strange behaviour. When running the BOHB hyperparameter optimization, there is only one job executed and not stopped. I aborted the job after 3500 epochs because I set the the max_iteration_per_job parameter to 1000 and the job seems to run infinitely long. I just downgraded the package back to version 12.14.1 and everything works as b...
Hello CostlyOstrich36 , thanks for your question. At the moment I am training a MLP for a regression problem and in one case I want to store the number of neurons per layer. Note that in my case it is not a hyperparameter because I calculate the number of neurons based on the number of layers and the number of model parameters. Another case is that I want to store some local paths where the models are stored, since I currently don't have any remote storage set up for my models.
Hi @<1523701205467926528:profile|AgitatedDove14> , thanks for your answer! Can you tell me, how specifically I map my clearml.conf to the containers? By the way, the credentials are already set (and working) in the clearml.conf.