Reputation
Badges 1
23 × Eureka!@<1523701070390366208:profile|CostlyOstrich36> I keep failing to execute the task with clearml-agent because of environment setting.. How could I adjust clearml.conf file for the agent to use specific local environment?
@<1523701070390366208:profile|CostlyOstrich36> here it is!
fyi,
I set the options for HyperParameterOptimizer() like,
- compute_time_limit=None,
- total_max_jobs=100,
- min_iteration_per_job=NOne,
- max_iteration_per_job=NOne,
- max_number_of_concurrent_tasks=1
@<1722061354531033088:profile|TroubledCamel37> but, I guess task.close()
would terminate the optimization task, not the single experiment. am I misunderstanding something? ðŸ˜
Yeah, the problem was about fileserver connection like you said!
I was running the experiment in remote server, and solved the issue by opening the port for fileserver! Thanks!
plus, the first experiment terminated with early stopping.
@<1722061354531033088:profile|TroubledCamel37> Thanks! I'll look over the connectivity issue that you said.
@<1722061354531033088:profile|TroubledCamel37> No, I didn't add "task.close()" in the code. This link is what I followed.
Even after completing one experiment, the console and UI don't seem to terminate the task.
I figured out the metrics should be provided in list format.
@<1523701070390366208:profile|CostlyOstrich36>
Actually, I've got another questions about dataset!
I tried add_external_files
from AWS S3 as a simple test.
And in web UI, it says it's been uploading for 16hours now.
The zip file I tried to upload is under 50MB.
Is something wrong here?
Also, I'm wondering if I could add files that are not "zipped" files, for example a directory containing various files.
I've figured out what's wrong and fixed it! Thanks!
@<1523701070390366208:profile|CostlyOstrich36>
My code is supposed to automatically clone a optimization task with template_task_id and execute each experiment. I didn't remove any lines from the logs.
When I run the code locally, I run it with a virtual environment activated. However, if I use clearml-agent daemon to execute the task, it seems like using a default docker image, and I don't know how to change the corresponding settings in the clearml.conf file!
@<1523701070390366208:profile|CostlyOstrich36> Would you mind looking over this issue?
But, I also would like to know how to run this with docker!
This is the log file!
Thanks a lot!!
@<1523701205467926528:profile|AgitatedDove14>
Thanks!
Would you mind walking me through the process?
Upon my understanding, first I'm gonna build a self-hosted server with docker on my windows computer.
Secondly, I'm gonna connect other windows computers with the server. To do that, I need a token from my server, so that I could copy and paste it when I execute the command clearml-agent init --token <my_token> --queue default
from 'other windows computers'.
Lastly, I just execute `c...
I've found this from docs.
Am I not supposed to run the agent in docker mode on Windows computer?
@<1523701070390366208:profile|CostlyOstrich36> I didn't specify remote version. Where can I check the version and adjust?
@<1523701070390366208:profile|CostlyOstrich36> I'm using python 3.9.11 and pytorch 1.11.0+cu113.
@<1523701070390366208:profile|CostlyOstrich36>
I have a follow-up question for the first question.
I initiated a task, did get_local_copy
of a dataset,
and then I executed and finished the task (training).
From web UI, I don't see any information saying that the task and dataset are related or linked.
What should I do to connect or link those two or find the information about it?
@<1523701070390366208:profile|CostlyOstrich36> Hi! Actually, I changed it to run the training in local environment now (not docker) !
I ran a queued task from web UI by clicking 'enqueue', and I got this error!
# Error logs
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manag...
Sorry for the late reply.
I believe this is the why it's not working (from console log):
adfba156d16e: Pull complete
Digest: sha256:0ce15c07d55860dfd2eeae535c42d85383a664821da5ff18d10448b5a2993e5a
Status: Downloaded newer image for ultralytics/yolov5:latest
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0...