Reputation
Badges 1
25 × Eureka!docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
Hi SuperiorDucks36
Could you post the entire log?
(could not resolve host seems to be coming from the "git clone" call).
Are you able to manually clone the repository on the machine running trains-agent
See if this helps
Hi HealthyStarfish45
You can disable the entire TB logging :Task.init('examples', 'train', auto_connect_frameworks={'tensorflow': False})
Great!
I'll make sure the agent outputs the proper error π
Basically you should not use Task.create to log the current execution. It is used to create a Task externally and then enqueue it for remote execution. Make sense?
Are you running the agent in docker mode? or venv mode ?
Can you manually ssh on port 10022 to the remote agent's machine ?ssh -p 10022 root@agent_ip_here
Is there a way to filter a experiments in a hyperparameter sweep based on a given range of a parameter/metric in the UI
Are you referring to the HPO example? or the Task comparison ?
Okay let me see if I can think of something...
Basically crashing on the assertion here ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L495
Could it be your are passing "Args/resume" True, but not specifying the checkpoint ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
I think I know what's going on:
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train...
Any chance you can zip the entire folder? I can't figure out what's missing, specifically "from config_files" , i.e. I have no packages nor file named config_files
Oh dear, I think your theory might be correct, and this is just the mongo preallocating storage.
Which means the entire /opt/trains just disappeared
Interesting use case, do you already have the connect_configuration in the code? or do we need to somehow create it ?
The notebook path goes through a symlink a few levels up the file system (before hitting the repo root, though)
Hmm sounds interesting, how can I reproduce it?
The notebook kernel is also not the default kernel,
What do you mean?
Yes that is an issue for me, even if we could centralize an environment today, it leaves a concern whenever we add a model that possible package changes are going to cause issues with older models.
yeah changing the environment on the fly is tricky, it basically means spinning an internal http service per model...
Notice you can have many clearml-serving-sessions, they are not limited, so this means you can always spin new serving with new environments. The limitation is changing an e...
Hi @<1751777160604946432:profile|SparklingDuck54>
Actually the Dataset ID on the task can be easily pulled via:
Task.get_task("task_uid_with_dataset").get_parameter("Datasets/<dataset_alias_name>")
or
Task.get_task("task_uid_with_dataset").get_parameters_as_dict().get("Datasets")
Hi StickyMonkey98
aΒ
very
Β large number of running and pending tasks, and doing that kind of thing via the web-interface by clicking away one-by-one is not a viable solution.
Bulk operations are now supported , upgrade the clearml-server to 1.0.2 π
Is it possible to fetch a list of tasks via Task.get_tasks,
Sure:Task.get_tasks(project_name='example', task_filter=dict(system_tags=['-archived']))
I created my own docker image with a newer python and the error disappeared
I'm not sure I understand how that solved it?!
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Correct, I think "vault" option is only available on the paid tier π
but how should we do this for the credentials?
I'm not sure how to pass them, wouldn't it make sense to give the agent an all accessing credentials ?
basically the default_output_uri will cause all models to be uploaded to this server (with specific subfolder per project/task)
You can have the same value there as the files_server.
The files_server is where you have all your artifacts / debug samples
Hi TenseOstrich47 whats the matplotlib version and clearml version you are using ?
I presume is via theΒ
project_name
Β andΒ
task_name
Β parameters.
You are correct in your assumption, it only happens when you call Task.init but two distinctions:
ArgParser arguments are overridden (with trains-agent) even before Task.init is called Task.init when running under trains-agent will totally ignore the project/task name, it receives a pre-made task id, and uses it. So the project name and experiment are meaningless if you are running the tas...
Yes that makes total sense to me. How about a GitHub issue on the clearml-docs ?
Sigint (ctrl c) only
Because flushing state (i.e. sending request) might take time so only when users interactively hit ctrl c we do that. Make sense?
Hi SteadyFox10
Short answer no π
Long answer, full permissions are available in the paid tier, along side a few more advanced features.
Fortunately in this specific use case, the community service allows you to share a single (or multiple) experiments with a read-only link. Would that work ?
Try this one πHyperParameterOptimizer.start_locally(...)
https://clear.ml/docs/latest/docs/references/sdk/hpo_optimization_hyperparameteroptimizer#start_locally
Basically try with the latest RC π
pip install trains 0.15.2rc0
In both case if I get the element from the list, I am not able to get when the task started. Where is info stored?
If you are using client.tasks.get_all( ...) should be under started field
Specifically you can probably also do:queried_tasks = Task.query_tasks(additional_return_fields=['started']) print(queried_tasks[0]['id'], queried_tasks[0]['started'],)
Okay good news, there is a fix, bad news, sync to GitHub will only be tomorrow