Reputation
Badges 1
35 × Eureka!the problem was docker, that had as entrypoint a bash script with python train.py --epochs=300
hardcoded, so I guess it was never acutally running the task setup from clearml.
Hi AgitatedDove14 , I’m talking about the following pip install.
After that pip install, it displays agent’s conf, shows installed packages, and launches the task (no installation)
` Running in Docker mode (v19.03 and above) - using default docker image: spoter ['-e CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1', '-e CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1']
Running task '3ebb680b17874cda8dc7878ddf6fa735'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.tsu2tddl.txt', '/tmp/.clearml_agent_o...
how do I mount my local ssh folder into /root/.ssh/
docker when running clearml-agent?
also, is there a way for it to not install the requirements, and simply run the task?
would it be possible to change de dataset.add_files to some function that moves your files to a common folder (local or cloud), and then use the last step in the dag to create the dataset using that folder?
I’m suggesting MagnificentWorm7 to do that yes, instead of adding the files to a ClearML dataset in each step
That’s why I’m suggesting him to do that 🙂
oh but docker-ps
shows me 8081 ports for webserver, apiserver and fileserver containers
` CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0b3f563d04af allegroai/clearml:latest "/opt/clearml/wrappe…" 7 minutes ago Up 7 minutes 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp clear...
Currently I’m changing /opt/ for my home folder
I also changed the permissions of /usr/share/elasticsearch
according to this post: https://techoverflow.net/2020/04/18/how-to-fix-elasticsearch-docker-accessdeniedexception-usr-share-elasticsearch-data-nodes/ , but I’m getting the same error
ok, I entered the container, replaced all 8081 to 8085 in every file, commited the container and changed the docker-compose.yml
to use that image instead of the allegroai/clearml:latest
and now it works 🙂
so I can run the experiments, I can see them, but no plots are saved because there is an upload problem when uploading to localhost:8085
there is no /usr/share/elasticsearch/logs/clearml.log
file (neither inside the container nor in my server)
exactly, somewhere in the docker running
I mean in the clearml-server docker
it would be easier for a sysadmin to center the credentials of the bucket in the clearml-server, without the need to distribute them…every user in the server has the same credentials, and they don’t need to know them..makes sense?
no because every user that is trying to write in the bucket has the same credentials
yes, I’m talking about AWS/GCP only
Hey! When you say it wasn’t enough, what do you mean? Can you launch the web UI?
ok, but if you were to run it from a different machine (or a different user!) it wouldn’t work
Hi! What the error is saying is that it is looking for the the ctbc/image_classification_CIFAR10.py
file in your repo.
So when you created the task you were inside a git repo, and ClearML assumed that all your files in it were commited and pushed. However your repo https://github.com/gradient-ai/PyTorch.git doesn’t contain these files
where is the dataset stored? maybe you deleted the credentials by mistake? or maybe you are not installing the libraries needed (for example if using AWS you need boto3, if GCP you need google-cloud-storage)
Is not direcly cached in the ~/.clearml
folder. There are some directories inside (one for storage, one for pip, another for venvs, etc.
So in your case it would be stored in ~/.clearml/cache/storage_manager/datasets/ds_{ds_id}/my_file.json
can you share you clearml.conf file? it should do that automatically if you set the development.default_output_uri key to “s3://{your_bucket}”
Right, you don’t want ClearML to track that package, but there isn’t much you can do there I believe. I was trying to tackle how to run your code with an agent given those dependencies.
I think that if you clone the experiment, and remove that line in the dependencies sections in the UI you should be able to launch it correctly (as long as your clearml-agent
has the correct permissions)
and btwif "
@" in line: line = line.replace("
@", "https://")
should be the same as
...
To give more context, he is running an hyper params optimization script, that internally clones a base task and runs it with certain params and checks if a metric increases or decreases. It is when the agent tries to run this task that the error raises.
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device
clearml_agent: ERROR: Could not install task requirements!
Command '['~/.clearml/venvs-builds/3.8/bin/python', '-m', 'pip', '--disable-pip-v...
can you share your clearml.conf
file (remove the critical information first)?