it is reachable to get data for training as image data is saved on the file server
can you attach the log of the task where ClearML SDK fails to resolve the output_uri?
the console output? or?
way im attempting to access with an id
cl_model_id = config.get("model_id")
#model = Model(model_id=cl_model_id)
model = InputModel(model_id=cl_model_id)
checkpoint = model.get_local_copy()
which logs are helpful? console output, fileserver, or api?
my main issue is that i can see that the model artefacts are here file:///root/.clearml/venvs-builds/3.11/task_repository/my_awesome_facility_model/checkpoint-33/scheduler.pt
which i believe is not persistent/retrievable with an artefact id
Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you add a code snippet that reproduces this + a log of the run?
this is how im initializing before calling my training function. this is inside my training_script:
task = Task.init(project_name=project_name, task_name=task_name, output_uri=True)
sure!
So this is how im queuing the job:
task = Task.create(
project_name="multiclass-classifier",
task_name="training",
repo="reponame_url",
branch="branch_name",
script='training_script_name',
packages=package_list,
docker="python:3.11",
docker_args="--privileged"
)
task.enqueue(task, "services") # services queue is the one with a remote worker
this task is currently running, i obfuscated some personal info
local conf
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "somekey", "secret_key": "somekey"}
}
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri:'
'
I was away for a week, was anyone able to come up with nay solutions for this?
Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you attach the log of the task where ClearML SDK fails to resolve the output_uri?
Is the None URL reachable from inside the container running the ClearML task code?
docker compose on server:
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
privileged: true
restart: unless-stopped
volumes:
- ${LOGS_DIR}:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- ${FILESERVER_DATA_DIR}:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
CLEARML_SERVER_DEPLOYMENT_TYPE: linux
CLEARML__apiserver__pre_populate__enabled: "true"
CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
CLEARML__services__async_urls_delete__enabled: "true"
CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
CLEARML__secure__credentials__services_agent__user_key: ${CLEARML_AGENT_ACCESS_KEY:-}
CLEARML__secure__credentials__services_agent__user_secret: ${CLEARML_AGENT_SECRET_KEY:-}
ports:
- "8008:8008"
networks:
clearml-backend:
aliases:
- servername
- servername.company.com
clearml-frontend:
elasticsearch:
networks:
- clearml-backend
container_name: clearml-elastic
privileged: true
environment:
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.type: "single-node"
http.compression_level: "7"
node.name: clearml
reindex.remote.whitelist: "'*.*'"
xpack.security.enabled: "false"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.18
restart: unless-stopped
volumes:
- ${ELASTICSEARCH_DATA_DIR}:/usr/share/elasticsearch/data
- /usr/share/elasticsearch/logs
# ports:
# - "8715:9200"
fileserver:
networks:
clearml-backend:
aliases:
- magrathea
- magrathea.ghgsat.com
clearml-frontend:
command:
- fileserver
container_name: clearml-fileserver
image: allegroai/clearml:latest
privileged: true
environment:
# CLEARML__secure__credentials__fileserver__user_key: ${CLEARML_AGENT_ACCESS_KEY:-}
# CLEARML__secure__credentials__fileserver__user_secret: ${CLEARML_AGENT_SECRET_KEY:-}
CLEARML__fileserver__delete__allow_batch: "true"
CLEARML__fileserver__auth__enabled: "false"
restart: unless-stopped
volumes:
- ${LOGS_DIR}:/var/log/clearml
- ${FILESERVER_DATA_DIR}:/mnt/fileserver
- ${CONFIG_DIR}:/opt/clearml/config
ports:
- "8081:8081"
mongo:
networks:
- clearml-backend
container_name: clearml-mongo
image: mongo:4.4.29
privileged: true
restart: unless-stopped
command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
volumes:
- ${MONGO_DATA_DIR}/db:/data/db
- ${MONGO_DATA_DIR}/configdb:/data/configdb
ports:
- "8714:27017"
redis:
networks:
- clearml-backend
container_name: clearml-redis
privileged: true
image: redis:6.2
restart: unless-stopped
volumes:
- ${REDIS_DATA_DIR}:/data
ports:
- "8713:6379"
webserver:
command:
- webserver
container_name: clearml-webserver
privileged: true
# environment:
# CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
image: allegroai/clearml:latest
restart: unless-stopped
depends_on:
- apiserver
ports:
- "8612:80"
networks:
- clearml-backend
- clearml-frontend
async_delete:
depends_on:
- apiserver
- redis
- mongo
- elasticsearch
- fileserver
container_name: async_delete
image: allegroai/clearml:latest
privileged: true
networks:
- clearml-backend
restart: unless-stopped
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
PYTHONPATH: /opt/clearml/apiserver
CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
entrypoint:
- python3
- -m
- jobs.async_urls_delete
- --fileserver-host
-
volumes:
- ${LOGS_DIR}:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
agent-services:
networks:
clearml-backend:
aliases:
- servername
- servername.company.com
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
CLEARML__AGENT__FORCE_GIT_SSH_PROTOCOL: true
CLEARML__AGENT__ENABLE_GIT_ASK_PASS: true
CLEARML__AGENT__GIT_HOST: repo
CLEARML_WEB_HOST:
CLEARML_API_HOST:
CLEARML_FILES_HOST:
CLEARML_API_ACCESS_KEY: ${CLEARML_AGENT_ACCESS_KEY}
CLEARML_API_SECRET_KEY: ${CLEARML_AGENT_SECRET_KEY}
CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
# AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
# AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
CLEARML_WORKER_ID: "clearml-services"
CLEARML_AGENT_DOCKER_HOST_MOUNT: "${AGENT_DIR}:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ${AGENT_DIR}:/root/.clearml
- /var/cache/analytics_shared:/var/cache/analytics_shared
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused '
' && /usr/agent/entrypoint.sh"
networks:
clearml-backend:
driver: bridge
clearml-frontend:
driver: bridge
This issue was solved by adding task.output_uri = "fileserver" in the scheduling script, but for some reason this does not work when setting in task.create call in the same script but needs to be set after. it also doesnt work when being set in the training script, so there must have been some unknown overriding