
Reputation
Badges 1
38 × Eureka!Morning, we got to 100% used which is what triggered this investigation. When we initially looked at overlay2 it was using 8GB, so now seems to be acceptable.
thanks @<1523715084633772032:profile|AlertBlackbird30> this is really informative. Nothing seems to be particularly out of the ordinary though
3.7G /var/lib/
3.7G /var/lib/docker
3.0G /var/lib/docker/overlay2
followed by a whole load of files that are a few hundred KBs in size, nothing huge though
that should be the case, we have default_output_uri:
set to an s3 bucket
yep, in most of them:
/opt/clearml/config
apiserver.conf
clearml.conf
/opt/clearml/data/elastic_7
/nodes
/opt/clearml/data/fileserver
<empty>
/opt/clearml/data/mongo/configdb
<empty>
/opt/clearml/data/mongo/db
collection/index files, /diagnostic.data, /journal etc
/opt/clearml/data/redis
dump.rdb
/opt/clearml/logs
apiserver.log.x, filserver.log (0 bytes)
not entirely sure on this as we used the custom AMI solution, is there any documentation on it?
is there any documentation for connecting to an S3 bucket?
Is it possible to use an IAM role rather than user credentials in the clearml.conf file?
thanks Stef, with max-size
do you set it for every running service separately, or can you set it once?
Hey @<1523701205467926528:profile|AgitatedDove14> I am helping Max to get this working. I ran the clearml-agent init
and now have the correct entries in the clearml.conf file.
Created an ssh key from the agent, uploaded it to the git repo, but still getting this error:
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
If I manually ssh on to the agent, and run:
`git clon...
no, they are still rebooting. i've looked in /opt/clearml/logs/apiserver.log
no errors
thanks, i'll try that out
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
not yet, going to try and fix it today.
if I do a df
I see this, which is concerning:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 928K 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 7.9G 13G 40% /
tmpfs 790M 0 790M 0% /run/user/1000
so it looks like the mount points are not created. When do these g...
strange, I used one of the publicly available AMIs for ClearML (we did not upgrade from the Trains AMI as started fresh)
Hi @<1523701205467926528:profile|AgitatedDove14>
Yes the clearml-server AMI - we want to be able to back it up and encrypt it on our account
it looks like clearml-apiserver
and clearml-fileserver
are continually restarting
yep still referring to the S3 credentials, somewhat familiar with boto and IAM roles
or have I got this wrong, and it's the clearml-agent that needs to read/write to S3?
no, that's what i'm trying to do
Yep i've done all that, it didn't seem to work until I set the deploy key to write
After making the change yesterday to the docker-compose file, the server is completely unusable - this is all I see for the /dashboard screen
so am I right in thinking it's just the mount points that are missing?based on the output of df
above
Thank you very much 🙂 I don't think our Data team ever use this container so I will stop it for now and comment it from the compose file
Thanks. Although it's AWS related, the context was with an error we see within clearml "ValueError: Insufficient permissions for None "
Is there a way you can allow our account to make a copy of the AMI and store it privately?
I added this to each of the containers
logging:
options:
max-file: 5
max-size: 10m