
Reputation
Badges 1
25 × Eureka!Do you have any advice for this step, (monitoring)? I feel like it's not very well documented.
Yeah I think it is complicated.
I would start with the example here: None
Basically what it does is create histogram over time of the values the Rest API gets. Then in graphana it visualizes those values.
Notice that the request latency / frequency are automatically logged ...
UnsightlyShark53 See if this one solves the problem :)
BTW: the reasoning for the message is that when running the task with "trains-agent" if the parsing of the argparser happens before the the Task is initialized, the patching code doesn't know if it supposed to override the values. But this scenario was fixed a long time ago, and I think the error was mistakenly left behind...
BTW: @<1673501397007470592:profile|RelievedDuck3> we just released 1.3.1 with better debugging, it prints full exception stack on failure to the clearml Serving Session Task.
I suggest you pull the latest image re run the docker compose and check what you have on the serving session Task in the UI
with ?
multipart: false
secure: false
If so, can you post here your aws.s3 section of the clearml.conf? (of course replacing the actual sensitive information with *s)
Hi RoundMosquito25
What do you mean by "local commits" ?
Ok..so I should generally avoid connecting complex objects? I guess I would create a 'mini dictionary' with a subset of params, and connectvthis instead.
In theory it should always work, but this specific one fails on a very pythonic paradigm (see below)
from copy import copy
an_object = copy(object)
A good rule of thumb is to connect any object/dict that you want to track or change later
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update 🙂
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can ha...
Hi @<1571308003204796416:profile|HollowPeacock58>
could you share the full log ?
but does that mean I have to unpack all the dictionary values as parameters of the pipeline function?
I was just suggesting a hack 🙂 the fix itself is transparent (I'm expecting it to be pushed tomorrow), basically it will make sure the sample pipeline will work as expected.
regardless and out of curiosity, if you only have one dict passed to the pipeline function, why not use named arguments ?
BeefyCow3 On the plot itself click on the json download button
Hi DeliciousBluewhale87
So basically no webhooks, the idea is that you have full API to query everything in the system and launch task based on any logic. You can check the slack monitoring example, it is basically doing that. Wdyt?
Maybe the configuration file changed?
None
The logic is if the name and project are the same, and there are no artifacts/models, and the last time it was created was under 72 hours, reuse the Task
Hi @<1577106212921544704:profile|WickedSquirrel54>
We are self hosting it using Docker Swarm
Nice!
and were wondering if this is something that the community would be interested in.
Always!
what did you have in mind? I have to admit I'm not familiar with the latest in Docker swarm but we all lover Docker the product and the company
The main issue is the model itself is stored on your files server that is/was configured to " None " this means that you cannot access it from anywhere other than tha actual machine (i.e. inside a container this is not accessible).
Change your configuration (i.e. clearml.conf) files_server: http://<Local_IP>:8081
Then rerun the example (importantly, re run the training so a new model will be generated and registered under the new address, with the IP). should work...
Hi @<1686547375457308672:profile|VastLobster56>
where are you getting stuck? are you getting any errors ?
I'm assuming the reason it fails is that the docker network is Only available for the specific docker compose. This means when you spin Another docker compose they do not share the same names. Just replace with host name or IP it should work. Notice this has nothing to do with clearml or serving these are docker network configurations
Wait, @<1686547375457308672:profile|VastLobster56> per your config clearml-fileserver
who sets this domain name? could it be that it is only on our host machine? you can quickly test by running any docker on your machine and running ping clearml-fileserver
from the docker itself.
also your log showed "could not download None ..." , I would expect it to be None ...
, no?
Of course, I used "localhost"
Do not use "localhost" use your IP then it would be registered with a URL that points to the IP and then it will work
Hi, what is host?
The IP of the machine running the ClearML server
So I wonder - why should an agent be related to a specific user's credentials? Is the right way to go about this is to create a "fake user" for the sake of the agent?
Very true you have to have credentials for the trains-agent, so it can "report" to the trains-server, that said, the creator of the Task (i.e. the person who cloned it) will be registered as the "user" in the UI.
I would recommend to create an "agent" user and put it's credentials on the trains-agent machine (the same way...
And is "requirements-dev.txt" in your git root folder?
What is your clearml-agent version?
SweetGiraffe8 Task.init will autolog everything (git/python packages/console etc), for your existing process.
Task.create purely creates a new Task in the system, and lets' you manually fill in all the details on that Task
Make sense ?
Hi @<1570583227918192640:profile|FloppySwallow46>
Hey I have a question, Can you monitor the time for one pipeline,
you mean to see the start / end time of the pipeline?
Click on the details link on the right hand side and you will have all the details on the pipeline task, including running time
Hi @<1524922424720625664:profile|TartLeopard58>
Yes this is the default it is designed to serve multiple models and scale horizontally
Hi DeliciousBluewhale87
When you say "workflow orchestration", do you mean like a pipeline automation ?
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
Okay, I'll make sure we always qoute "
, since it seems to work either way.
We will release an RC soon, with this fix.
Sounds good?