Reputation
Badges 1
606 × Eureka!I see. Thank you very much. For my current problem giving priority according to queue priority would kinda solve it. For experimentation I will sometimes enqueue a task and then later enqueue a another one of a different kind, but what happens is that even though this could be trivially solved, I will have to wait for the first one to finish. I guess this is only a problem for people with small "clusters" where SLURM does not make sense, but no scheduling at all is also suboptimal.
However, I...
As in if it was not empty it would work?
I think such an option can work, but actually if I had free wishes I would say that the clearml.Task code would need some refactoring (but I am not an experienced software engineer, so I could be totally wrong). It is not clear, what and how Task.init
does what it does and the very long method declaration is confusing. I think there should be two ways to initialize tasks:
Specify a lot manually, e.g. ` task = Task.create()
task.add_requirements(from_requirements_files(..))
task.add_entr...
Afaik, clearml-agent will use existing installed packages if they fit the requirements.txt. E.g. pytorch >= 1.7
will only install PyTorch if the environment does not already provide some version of PyTorch greater or equal to 1.7.
So it seems to be definitely a problem with docker and not with clearml. However, I do not get, why it works for you but on none of my machine (all Ubuntu 20.04 with docker 20.10)
` apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
- fileserver_datasets
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_...
Sure, no problem!
I got the idea from an error I got when the agent was configured to use pip and tried to install BLAS (for PyTorch I guess) and it threw an error.
I just updated my server to 1.0 and now the services agent is stuck in restarting:
It seems like the services-docker is always started with Ubuntu 18.04, even when I usetask.set_base_docker( "continuumio/miniconda:latest -v /opt/clearml/data/fileserver/:{}".format( file_server_mount ) )
Ah, very cool! Then I will try this, too.
Thanks for the answer. So currently the cleanup is done based number of experiments that are cached? If I have a few big experiments, this could make my agents cache overflow?
I created an issue on using conda as package manager: https://github.com/allegroai/clearml-agent/issues/44
Outside of the cleaml.Task?
Ah, nevermind. I thought wrong here.
Unfortunately, I do not know that. Must be before October 2021 at least. I know I asked here how to use the preinstalled version and AgitatedDove14 helped me to get it work. But I cannot find the old thread 😕
These are the errors I get if I use file_servers without a bucket ( s3://my_minio_instance:9000 )
2022-11-16 17:13:28,852 - clearml.storage - ERROR - Failed creating storage object
Reason: Missing key and secret for S3 storage access (
) 2022-11-16 17:13:28,853 - clearml.metrics - WARNING - Failed uploading to
('NoneType' object has no attribute 'upload_from_stream') 2022-11-16 17:13:28,854 - clearml.storage - ERROR - Failed creating storage object
` Reason: Missing key...
btw: I also tested the clearml-agent running on a different machine and with python 3.8 and I get the same problems.
Can you tell me how I create tasks correctly? The PipelineController.add_step
takes the task-id/task-name, but I would rather just define a function that returns the task directly, since the base-task may not be already on the clearml-server.
It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
@<1523701435869433856:profile|SmugDolphin23> Good catch. I have a good but unsatisfying message for you guys: I restarted the whole machine (server and agent) and now it works fine ...
So with pipeline decorators can I implement this logic?
AgitatedDove14 Yes, you understood correctly. But Task.create
is used by Task.init
something like this, right?
` def init(project_name, task_name):
if not Task.exists_already(project_name, task_name):
task = Task.create(...)
else:
task = load_existing_task()
return task `
I am going to try it again and send you the relevant part of the logs in a minute. Maybe I am interpreting something wrong.
I guess it started with the usage of the cleanup_service.