Hi Everyone, I'M Having Trouble Setting My Output_Uri Such That My Model Checkpoints Are Saved Outside Of The Venv And Accessible Via Id Or For Download. Im Running Clear-Ml On A Remote Server Through Docker And I Believe Clearml Is Unable To Resolve The

Answered

Hi everyone, I'm having trouble setting my output_uri such that my model checkpoints are saved outside of the venv and accessible via id or for download. Im running clear-ml on a remote server through docker and i believe clearml is unable to resolve the fileserver uri internally (i believe this as i had a similar issue accessing my data during training which was resolved by adding alias names in the docker file for the clear-ml backend). The debug samples tab is showing: fileserver ’s server IP address could not be found.

default_output_dir set in conf as None and Task.init(output_uri=True) - inside training script

experiment is kicked off with a separate script the uses task create from github url and enqueue

Thanks in advance! wracking my head on this for days

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Votes Newest

Answers 13

Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you add a code snippet that reproduces this + a log of the run?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I was away for a week, was anyone able to come up with nay solutions for this?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

which logs are helpful? console output, fileserver, or api?

my main issue is that i can see that the model artefacts are here file:///root/.clearml/venvs-builds/3.11/task_repository/my_awesome_facility_model/checkpoint-33/scheduler.pt
which i believe is not persistent/retrievable with an artefact id

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

can you attach the log of the task where ClearML SDK fails to resolve the output_uri?

the console output? or?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

docker compose on server:


  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    privileged: true
    restart: unless-stopped
    volumes:
    - ${LOGS_DIR}:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - ${FILESERVER_DATA_DIR}:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: linux
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
      CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
      CLEARML__secure__credentials__services_agent__user_key: ${CLEARML_AGENT_ACCESS_KEY:-}
      CLEARML__secure__credentials__services_agent__user_secret: ${CLEARML_AGENT_SECRET_KEY:-}
    ports:
    - "8008:8008"
    networks:
      clearml-backend:
          aliases:
            - servername
            - servername.company.com
      clearml-frontend:

  elasticsearch:
    networks:
      - clearml-backend
    container_name: clearml-elastic
    privileged: true
    environment:
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.type: "single-node"
      http.compression_level: "7"
      node.name: clearml
      reindex.remote.whitelist: "'*.*'"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.18
    restart: unless-stopped
    volumes:
      - ${ELASTICSEARCH_DATA_DIR}:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs
    # ports:
    # - "8715:9200"

  fileserver:
    networks:
      clearml-backend:
          aliases:
            - magrathea
            - magrathea.ghgsat.com
      clearml-frontend:
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:latest
    privileged: true
    environment:
      # CLEARML__secure__credentials__fileserver__user_key: ${CLEARML_AGENT_ACCESS_KEY:-}
      # CLEARML__secure__credentials__fileserver__user_secret: ${CLEARML_AGENT_SECRET_KEY:-}
      CLEARML__fileserver__delete__allow_batch: "true"
      CLEARML__fileserver__auth__enabled: "false"
    restart: unless-stopped
    volumes:
    - ${LOGS_DIR}:/var/log/clearml
    - ${FILESERVER_DATA_DIR}:/mnt/fileserver
    - ${CONFIG_DIR}:/opt/clearml/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - clearml-backend
    container_name: clearml-mongo
    image: mongo:4.4.29
    privileged: true
    restart: unless-stopped
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
    volumes:
    - ${MONGO_DATA_DIR}/db:/data/db
    - ${MONGO_DATA_DIR}/configdb:/data/configdb
    ports:
    - "8714:27017"

  redis:
    networks:
      - clearml-backend
    container_name: clearml-redis
    privileged: true
    image: redis:6.2
    restart: unless-stopped
    volumes:
    - ${REDIS_DATA_DIR}:/data
    ports:
    - "8713:6379"

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    privileged: true
    # environment:
    #  CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
    image: allegroai/clearml:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8612:80"
    networks:
      - clearml-backend
      - clearml-frontend

  async_delete:
    depends_on:
      - apiserver
      - redis
      - mongo
      - elasticsearch
      - fileserver
    container_name: async_delete
    image: allegroai/clearml:latest
    privileged: true
    networks:
      - clearml-backend
    restart: unless-stopped
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      PYTHONPATH: /opt/clearml/apiserver
      CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
    entrypoint:
      - python3
      - -m
      - jobs.async_urls_delete
      - --fileserver-host
      -


    volumes:
      - ${LOGS_DIR}:/var/log/clearml
      - /opt/clearml/config:/opt/clearml/config

  agent-services:
    networks:
      clearml-backend:
        aliases:
            - servername
            - servername.company.com
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    deploy:
      restart_policy:
        condition: on-failure
    privileged: true
    environment:
      CLEARML__AGENT__FORCE_GIT_SSH_PROTOCOL: true
      CLEARML__AGENT__ENABLE_GIT_ASK_PASS: true
      CLEARML__AGENT__GIT_HOST: repo
      CLEARML_WEB_HOST:


      CLEARML_API_HOST:


      CLEARML_FILES_HOST:


      CLEARML_API_ACCESS_KEY: ${CLEARML_AGENT_ACCESS_KEY}
      CLEARML_API_SECRET_KEY: ${CLEARML_AGENT_SECRET_KEY}
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      # AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      # AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "${AGENT_DIR}:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ${AGENT_DIR}:/root/.clearml
      - /var/cache/analytics_shared:/var/cache/analytics_shared
    depends_on:
      - apiserver
    entrypoint: >
      bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused '

' && /usr/agent/entrypoint.sh"

networks:
  clearml-backend:
    driver: bridge
  clearml-frontend:
    driver: bridge

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

sure!

So this is how im queuing the job:

task = Task.create(
    project_name="multiclass-classifier",
    task_name="training",
    repo="reponame_url",
    branch="branch_name",
    script='training_script_name',
    packages=package_list,
    docker="python:3.11",
    docker_args="--privileged"

)
task.enqueue(task, "services")  # services queue is the one with a remote worker

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Hi @<1730033904972206080:profile|FantasticSeaurchin8> , can you attach the log of the task where ClearML SDK fails to resolve the output_uri?
Is the None URL reachable from inside the container running the ClearML task code?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

way im attempting to access with an id

cl_model_id = config.get("model_id")
    #model = Model(model_id=cl_model_id)
    model = InputModel(model_id=cl_model_id)
    checkpoint = model.get_local_copy()

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

local conf

api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server:


    web_server:


    files_server:


    # Credentials are generated using the webapp,


    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "somekey", "secret_key": "somekey"}
}

 # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
        default_output_uri:'

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

it is reachable to get data for training as image data is saved on the file server

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

This issue was solved by adding task.output_uri = "fileserver" in the scheduling script, but for some reason this does not work when setting in task.create call in the same script but needs to be set after. it also doesnt work when being set in the training script, so there must have been some unknown overriding

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

this is how im initializing before calling my training function. this is inside my training_script:

task = Task.init(project_name=project_name, task_name=task_name, output_uri=True)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

this task is currently running, i obfuscated some personal info

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FantasticSeaurchin8
				
					0
					 × 1

Write your answer

1K Views

13 Answers

one year ago