Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Community! I'M Facing An Issue With A Self-Hosted Clearml Server. I Modified The Docker-Compose File So To Have All The Volumes Mounted In A Specific Location (

Hi community! I'm facing an issue with a self-hosted ClearML server.
I modified the docker-compose file so to have all the volumes mounted in a specific location ( None ) but when I run docker-compose and inspect docker-compose ps I see clearml-elastic, mongo and redis are always restarting

# docker-compose ps
WARNING: The ELASTIC_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_HOST_IP variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_USER variable is not set. Defaulting to a blank string.
WARNING: The CLEARML_AGENT_GIT_PASS variable is not set. Defaulting to a blank string.
WARNING: Some services (agent-services) use the 'deploy' key, which will be ignored. Compose does not support 'deploy' configuration - use `docker stack deploy` to deploy to a swarm.
         Name                       Command                 State                                    Ports
-----------------------------------------------------------------------------------------------------------------------------------------
async_delete             python3 -m jobs.async_urls ...   Up           8008/tcp, 8080/tcp, 8081/tcp
clearml-agent-services   bash -c curl --retry 10 -- ...   Up
clearml-apiserver        /opt/clearml/wrapper.sh ap ...   Up           0.0.0.0:8008->8008/tcp,:::8008->8008/tcp, 8080/tcp, 8081/tcp
clearml-elastic          /bin/tini -- /usr/local/bi ...   Restarting
clearml-fileserver       /opt/clearml/wrapper.sh fi ...   Up           8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp,:::8081->8081/tcp
clearml-mongo            docker-entrypoint.sh --set ...   Restarting
clearml-redis            docker-entrypoint.sh redis ...   Restarting
clearml-webserver        /opt/clearml/wrapper.sh we ...   Up           0.0.0.0:8080->80/tcp,:::8080->80/tcp, 8008/tcp, 8080/tcp, 8081/tcp

and the web interface is not working "Server Unavailable" message
Any clue why is this happening? (the target directory has been sudo chown -R 1000:1000 ´)

Many thanks!

  
  
Posted 12 months ago
Votes Newest

Answers 4


Also, I think getting the logs for the restarting docker container will help

  
  
Posted 12 months ago

Hi @<1523701087100473344:profile|SuccessfulKoala55> , seems to be related to some permission setting as I am getting permission denied errors even though I ran the chown command as reported in the tutorial.
Here some snippets from the log and here is also the diff btw the original docker-compose and the modified one -> None

ELASTIC

/usr/share/elasticsearch/bin/elasticsearch-env: line 158: cannot create temp file for here-document: Permission denied
/usr/share/elasticsearch/bin/elasticsearch-env: line 158: cannot create temp file for here-document: Permission denied
Exception in thread "main" java.nio.file.AccessDeniedException: /tmp/elasticsearch-17994849695337179062
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:397)
        at java.base/java.nio.file.Files.createDirectory(Files.java:700)
        at java.base/java.nio.file.TempFileHelper.create(TempFileHelper.java:134)
        at java.base/java.nio.file.TempFileHelper.createTempDirectory(TempFileHelper.java:171)
        at java.base/java.nio.file.Files.createTempDirectory(Files.java:1017)
        at org.elasticsearch.tools.launchers.Launchers.createTempDirectory(Launchers.java:55)
        at org.elasticsearch.tools.launchers.TempDirectory.main(TempDirectory.java:43)
WEBSERVER 

10.169.8.21 - - [15/Nov/2023:07:55:08 +0000] "POST /xmlrpc.php HTTP/1.1" 405 157 "-" "-"
2023/11/15 07:55:08 [error] 11#11: *44 connect() failed (111: Connection refused) while connecting to upstream, client: 10.169.8.21, server: _, request: "GET /api/getServices?name[]=$() HTTP/1.1", upstream: "
", host: "10.169.8.196:8080"
2023/11/15 07:55:08 [error] 11#11: *44 open() "/usr/share/nginx/html/50x.html" failed (2: No such file or directory), client: 10.169.8.21, server: _, request: "GET /api/getServices?name[]=$() HTTP/1.1", upstream: "
", host: "10.169.8.196:8080"
APISERVER

[2023-11-16 14:06:08,791] [9] [INFO] [clearml.redis_manager] Using override redis host redis
[2023-11-16 14:06:08,792] [9] [INFO] [clearml.redis_manager] Using override redis port 6379
[2023-11-16 14:06:08,804] [9] [INFO] [clearml.es_factory] Using override elastic host elasticsearch
[2023-11-16 14:06:08,806] [9] [INFO] [clearml.es_factory] Using override elastic port 9200
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 707, in connect
    sock = self.retry.call_with_retry(
  File "/usr/local/lib/python3.9/site-packages/redis/retry.py", line 46, in call_with_retry
    return do()
  File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 708, in <lambda>
    lambda: self._connect(), lambda error: self.disconnect(error)
  File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 974, in _connect
    for res in socket.getaddrinfo(
  File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/clearml/apiserver/server.py", line 6, in <module>
    from apiserver.server_init.app_sequence import AppSequence
  File "/opt/clearml/apiserver/server_init/app_sequence.py", line 9, in <module>
    from apiserver.bll.queue.queue_metrics import MetricsRefresher
  File "/opt/clearml/apiserver/bll/queue/__init__.py", line 1, in <module>
    from .queue_bll import QueueBLL
  File "/opt/clearml/apiserver/bll/queue/queue_bll.py", line 12, in <module>
    from apiserver.bll.queue.queue_metrics import QueueMetrics
  File "/opt/clearml/apiserver/bll/queue/queue_metrics.py", line 22, in <module>
    redis = redman.connection("apiserver")
  File "/opt/clearml/apiserver/redis_manager.py", line 80, in connection
    obj.get("health")
  File "/usr/local/lib/python3.9/site-packages/redis/commands/core.py", line 1816, in get
    return self.execute_command("GET", name)
  File "/usr/local/lib/python3.9/site-packages/redis/client.py", line 1266, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 1461, in get_connection
    connection.connect()
  File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 713, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error -3 connecting to redis:6379. Temporary failure in name resolution.
FILESERVER
[2023-11-15 07:55:23,884] [8] [ERROR] [werkzeug] 10.169.8.21 - - [15/Nov/2023 07:55:23] code 400, message Bad request syntax ('\x16\x03\x03\x01\x8f\x01\x00\x01\x8b\x03\x03\x91\x15(\x9dLÃ,½lv·¦Ü<Y¬ÉÄ')
[2023-11-15 07:55:24,474] [8] [ERROR] [werkzeug] 10.169.8.21 - - [15/Nov/2023 07:55:24] code 400, message Bad request version ("X£¢m[L¬ÚW¹\x81\x95üåc½MoÍ´\x9fF\x00f\x13\x02\x13\x01À,À+À0À/\x00\x9f\x00£\x00\x9e\x00¢À$À(À#À'\x00k\x00j\x00g\x00@À.À2À-À1À&À*À%À)À")
MONGO
{"t":{"$date":"2023-11-16T14:08:10.267+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Ubuntu","version":"20.04"}}}
{"t":{"$date":"2023-11-16T14:08:10.267+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"*"},"setParameter":{"internalQueryMaxBlockingSortMemoryUsageBytes":"196100200"}}}}
{"t":{"$date":"2023-11-16T14:08:10.267+00:00"},"s":"E",  "c":"STORAGE",  "id":20568,   "ctx":"initandlisten","msg":"Error setting up listener","attr":{"error":{"code":9001,"codeName":"SocketException","errmsg":"Permission denied"}}}
{"t":{"$date":"2023-11-16T14:08:10.267+00:00"},"s":"I",  "c":"REPL",     "id":4784900, "ctx":"initandlisten","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":10000}}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"COMMAND",  "id":4784901, "ctx":"initandlisten","msg":"Shutting down the MirrorMaestro"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"SHARDING", "id":4784902, "ctx":"initandlisten","msg":"Shutting down the WaitForMajorityService"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"NETWORK",  "id":4784905, "ctx":"initandlisten","msg":"Shutting down the global connection pool"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"NETWORK",  "id":4784918, "ctx":"initandlisten","msg":"Shutting down the ReplicaSetMonitor"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"SHARDING", "id":4784921, "ctx":"initandlisten","msg":"Shutting down the MigrationUtilExecutor"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"CONTROL",  "id":4784925, "ctx":"initandlisten","msg":"Shutting down free monitoring"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"STORAGE",  "id":4784927, "ctx":"initandlisten","msg":"Shutting down the HealthLog"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"STORAGE",  "id":4784929, "ctx":"initandlisten","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"-",        "id":4784931, "ctx":"initandlisten","msg":"Dropping the scope cache for shutdown"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"FTDC",     "id":4784926, "ctx":"initandlisten","msg":"Shutting down full-time data capture"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"initandlisten","msg":"Now exiting"}
{"t":{"$date":"2023-11-16T14:08:10.268+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"initandlisten","msg":"Shutting down","attr":{"exitCode":48}}
{"t":{"$date":"2023-11-16T14:09:10.717+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2023-11-16T14:09:10.719+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2023-11-16T14:09:10.720+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2023-11-16T14:09:10.720+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":1,"port":27017,"dbPath":"/data/db","architecture":"64-bit","host":"54deb300813e"}}
{"t":{"$date":"2023-11-16T14:09:10.720+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.9","gitVersion":"b4048e19814bfebac717cf5a880076aa69aba481","openSSLVersion":"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}
REDIS

1:M 16 Nov 2023 14:09:19.017 * Running mode=standalone, port=6379.
1:M 16 Nov 2023 14:09:19.017 # Server initialized
1:M 16 Nov 2023 14:09:19.017 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 16 Nov 2023 14:09:19.017 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 16 Nov 2023 14:09:19.017 # Fatal error loading the DB: Permission denied. Exiting.
  
  
Posted 11 months ago

Hi @<1630014842674876416:profile|ObnoxiousBluewhale46> , which fields did you change exactly in the docker-compose?

  
  
Posted 12 months ago

Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None

  
  
Posted 12 months ago
763 Views
4 Answers
12 months ago
11 months ago
Tags