Reputation
Badges 1
25 × Eureka!Ohh then use the AWS autoscaler, basically it what you want, spin an EC2 and set an agent there, then if the EC2 goes down (for example if this is a spot), it will spin it up again automatically with the running Task on it.
wdyt?
Wait, @<1686547375457308672:profile|VastLobster56> per your config clearml-fileserver
who sets this domain name? could it be that it is only on our host machine? you can quickly test by running any docker on your machine and running ping clearml-fileserver
from the docker itself.
also your log showed "could not download None ..." , I would expect it to be None ...
, no?
I'm assuming the reason it fails is that the docker network is Only available for the specific docker compose. This means when you spin Another docker compose they do not share the same names. Just replace with host name or IP it should work. Notice this has nothing to do with clearml or serving these are docker network configurations
The main issue is the model itself is stored on your files server that is/was configured to " None " this means that you cannot access it from anywhere other than tha actual machine (i.e. inside a container this is not accessible).
Change your configuration (i.e. clearml.conf) files_server: http://<Local_IP>:8081
Then rerun the example (importantly, re run the training so a new model will be generated and registered under the new address, with the IP). should work...
Of course, I used "localhost"
Do not use "localhost" use your IP then it would be registered with a URL that points to the IP and then it will work
Hi @<1686547375457308672:profile|VastLobster56>
where are you getting stuck? are you getting any errors ?
I ran the test, but there was no result.
what do you mean by no result, no data after the new query?
and this?avg(100*increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m]))
try to break it into parts and understand what produces the error
for example:increase(test12_model_custom:Glucose_bucket[1m])
increase(test12_model_custom:Glucose_sum[1m])
increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m])
and so on
Also what's the additional p
doing at the last line if the screenshot ?
Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?
Hi MammothGoat53
Basically what you are missing are the headers with the Token you have:
https://blog.logrocket.com/secure-rest-api-jwt-authentication/
Could it be it defaulted to the demo server instead of your own server?
Seems like credentials error
Do you have everything setup correctly in your ~/clearml.conf
?
Scheduled training is what Iām looking forward to
I'll try to remember to update here once we pushed into the GitHub repo, feedback is always appropriated š
If in the next two weeks you hear nothing, please ping here to make sure I did not forget š
Hi ComfortableHorse5
Yes this is more of a suggestion that you should write them using the platform capabilities, the UI implementation is being worked on, as well as a few helpers classes, I thin you'll be able to see a few in the next release š
Hi @<1729309120315527168:profile|ShallowLion60>
Clearml in our case installed on k8s using helm chart (version: 7.11.0)
It should be done "automatically", I think there is a configuration var in the helm chart to configure that.
What urls are you urls seeing now, and what should be there?
EnviousStarfish54 thanks again for the reproducible code, it seems this is a Web UI bug, I'll keep you updated.
The upload itself is in the background.
It should not take long to prepare the plot for sending. Are you experiencing a major delay ?
Thanks EnviousStarfish54 !
EnviousStarfish54
and the 8 charts are actually identical
Are you plotting the same plot 8 times?
Hi EnviousStarfish54
After the pop up do you see the plot on the web UI?
EnviousStarfish54 whats your matplotlib version ?
GrievingTurkey78 yes, you are correct on both.
Will the sweep functionality work?
Yes it should, that said, it will not use the trains-agent
so you are limited to the machine running the sweep.
If you want to do HPO on multi-node, checkout this example š
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
Hi PungentLouse55 ,
Yes we have integration with hydra on the todo list since it was first released, we actually know the guy behind Hydra, he is awesome!
What are your thoughts on integration, we would love to get feedback and pointers (Hydra itself is quite capable, and we waiting until we have multiple configuration support, and with v0.16 it was added, so now it is actually possible)
For example, opening a project or experiment page might take half a minute.
This implies mongodb performance issue
What's the size of the mongo DB?
DepressedChimpanzee34
I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?
On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
DisturbedWalrus17
By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
Can you try with even more ...