I think the limit is a few GB, I'm not sure, I'll have to check
And yes the oldest experiments will be deleted first (with the exception of published experiments, they will be deleted last)
WickedGoat98 are you running the agent with --gpus ?
this is a snippet of the YML configuration I'm currently usingagent-services: networks: - backend container_name: clearml-agent-services image: allegroai/clearml-agent-services:latest restart: unless-stopped privileged: true environment: CLEARML_HOST_IP: ${CLEARML_HOST_IP} CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-} CLEARML_API_HOST:
`
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
CLEARML_WORKER_ID: "clearml-services"
CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
default-agent-services:
networks:
- backend
container_name: clearml-default-agent-services
image: allegroai/clearml-agent:latest
restart: unless-stopped
privileged: true
environment:
CLEARML_HOST_IP: ${CLEARML_HOST_IP}
CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
CLEARML_API_HOST:
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
CLEARML_WORKER_ID: "clearml-default"
CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver `
AgitatedDove14 so far not, I just reuse the docker image as it is, and it is not using the gpu parameter at all. It will be the next step to create an own image running the agent with this parameter, but than I faced the error messages and the url http://apiserver:8008 which I don't understand
AgitatedDove14 if I would run an agent on a remote system, which ports do I need to open to let it work on a clearml-server?
yes, this works, but just for completeness I wanted to add it to the composition ... nevermind, maybe too much details for an article 😉
AgitatedDove14 while playing (and documenting) the way to run clearml dockerized on the local machine, I noticed that the yml file https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml containsCLEARML_API_HOST:
http://apiserver:8008
I duplicated this configration (agent-services) section and adapted it to run the default queue hagent with the image allegroai/clearml-agent:latest
I hoped to have GPU support by this but so far haven't seen the GPU usage line plot ...
I see an error in the results page when cloning an experiment2021-01-24 13:17:18,557 - clearml.metrics - WARNING - Failed uploading to
(HTTPConnectionPool(host='apiserver', port=8081): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff7f0386f98>: Failed to establish a new connection: [Errno 111] Connection refused',))) 2021-01-24 13:17:18,557 - clearml.metrics - ERROR - Not uploading 1/10 events because the data upload failed Test set: Average loss: 0.1259, Accuracy: 9599/10000 (96%) 9920512it [00:38, 257096.64it/s]
might it be, that the configuration in the yml is wrong as it refers to an unknown apiserver url?
Should it beCLEARML_API_HOST: http://${CLEARML_HOST_IP}:8008
?
😞 when editing the composistion to use the configured host ip as apiserver the queued work is never processed 😞
WickedGoat98 the agent itself can be executed on bare metal, no need to setup a docker for it (although fully supported)
Specifically the docker compose has the docker running in services mode, i.e. for CPU light weight tasks such as running pipelines .
If the agent running on GPU, the easiest way to is run on bare metal
WickedGoat98 you mean the server is on your home network and the agents are in a VPS?
If this is the case, then regular "clearml-server" port forwarding is the only thing you need.
TCP ports 8008/8080/8081
Notice that on the agents you will have to specify the address of your home IP.
I would recommend using a host name and not IP, since the artifact/debug samles links will contain direct links into the file server, and it is always safer to have a host name rather than IP that can change.
Even just defining the host name on your /etc/hosts will work 🙂
AgitatedDove14 ok, and how much storage is an account allowed to use? Omce reached, will the oldest experiments been deleted ?
AgitatedDove14 the reason I'm asking is that I'm going to run the server in my home network and would like to run agents on virtual servers I run on a VPS provider.
As I have to configure my router to forward the requests to my local server, I need to know the ports and protocoll settings (I expect TCP not UDP) I have to configure
ok, thanks. This is enough information. You don't need to check how much space is provided to the accounts
WickedGoat98 no need to open any ports on the agent's machine, the agent is polling the clearml-server, so as long as it can reach it, we are good.
Can you see all the agent in the UI (that basically means they are configured correctly and can connect to the server)
WickedGoat98 Notice this is not the "clearml-agent-services" docker but "clearml-agent" docker image
Also the default docker image is "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
Other than that quite similar :)
WickedGoat98 sure that will not be complicated:
try something along the lines of :agent: networks: - backend container_name: clearml-agent image: allegroai/clearml-agent:latest restart: unless-stopped privileged: true environment: CLEARML_HOST_IP: ${CLEARML_HOST_IP} CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-} CLEARML_API_HOST:
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-} CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-} CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-} CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER} CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS} CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0} CLEARML_AGENT_DEFAULT_BASE_DOCKER: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04" AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-} AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-} AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-} AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-} AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-} GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-} CLEARML_WORKER_ID: "clearml-agent" CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml" volumes: - /var/run/docker.sock:/var/run/docker.sock - /opt/clearml/agent:/root/.clearml depends_on: - apiserver
seems I'm wrong. The queues are there, but the workers are not
WickedGoat98 Forever 🙂
The limitation is on the storage size
regarding the list of agents, yes, the one additional I added I can see in the list