Hi WackyRabbit7 ,
Regrading git credentials, see here in the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L18
Trains assumes one of two (almost three) possible setups
Your code/script is in a git repository. Then when executing manually all the git references incl` uncommitted changes are stored. Then when executing with the trains-agent, it will clone the code based on these references apply the uncommitted changes and run your code. To do that the ...
Hi @<1697419082875277312:profile|OutrageousReindeer5>
Is NetApp S3 protocol enabled or are you referring to NFS mounts?
trains[azure] give you the possibility to do the following:from trains import StorageManager my_local_cached_file = StorageManager.get_local_copy('azure://bucket/folder/file.bin')
This means you do not have to manually download stuff/ and maintain the cache local cache, the StorageManager will do that for you.
If you do no need that ability, no need to install the trains[azure]
you can just install trains
Unfortunately, we haven't had the time to upgrade to the Azure storage v...
Exactly 🙂
If you feel like PR-ing a fix, it will be greatly appreciated 🙂
hi ElegantCoyote26
but I can't see any documentation or examples about the updates done in version 1.0.0
So actually the docs are only for 1.0... https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving
Hi there, are there any plans to add better documentation/example
Yes, this is work in progress, the first Item on the list is custom model serving example (kind of like this one https://github.com/allegroai/clearml-serving/tree/main/examples/pipeline )
about...
on the host machine or inside the containers that are spinning on the host machine ?
BTW: the same hold for tagging multiple experiments at once
"Updates a few seconds ago"
That just means that the process is not dead.
Yes that seemed to be stuck 😞
Any chance you can verify with the RC version?
I'll try to dig into the commits, maybe I can come up with an explanation ...
RobustGoldfish9 do you see the trains-agent listed as a machine in the UI (under workers)
Sure, venv mode
Here you go 🙂
(using trains_agent for easier all data access)from trains_agent import APIClient client = APIClient() log_events = client.events.get_scalar_metric_data(task='11223344aabbcc', metric='valid_average_dice_epoch') print(log_events)
Hmm GreasyLeopard35 can you specify the range you are passing to the HPO, as well as the type of optimization class ? (grid/random/optuna etc.)
GreasyLeopard35 I think you are on to something, I think UniformParameterRange just misses a min value:
https://github.com/allegroai/clearml/blob/fcad50b6266f445424a1f1fb361f5a4bc5c7f6a3/clearml/automation/parameters.py#L168
Should be:[self.min_value + v*step_size for v in range(0, int(steps))]
Hmm I see, add this for example
extra_docker_shell_script: ["rm ~/.bashrc", "echo removed bashrc"]
I came across it before but thought its only relevant for credentials
We are working on improving the docs, hopefully it will get clearer 😉
The agent cannot use another user (it literally has no way of getting credentials). I suspect this is all a by product of the actual mount point)
You cannot change the user once you have mount the shared folder with wither CIFS or NFS
Yes, it could, crontab uses the user it is running from (root if used with sudo)
Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:trains-agent execute --docker --id aabbcc
This means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)
So that means your home folder is always mapped to ~/ on any machine you ssh to ?
Hi UpsetCrocodile10
First, I perform many experiments in one process, ...
How about this one:
https://github.com/allegroai/trains/issues/230#issuecomment-723503146
Basically you could utilize create_function_task
This means you have Task.init() on the mainn "controller" and each "train_in_subset" as a "function_task". Them the controller can wait on them, and collect the data (like the HPO does.
Basically:
` controller_task = Task.init(...)
children = []
for i, s in enumer...
UpsetCrocodile10
Does this method expect
my_train_func
to be in the same file as
As long as you import it and you can pass it, it should work.
Child exp get's aborted immediately ...
It seems it cannot find the file "main.py" , it assumes all code is part of a single repository, is that the case ? What do you have under the "Execution" tab for the experiment ?
Hi UpsetCrocodile10
execute them and return scalars.
This should be a good start (I hope 🙂 )
` for child in children:
put the Task into an execution queue
Task.enqueue(child, queue_name='my_queue_here')
wait for the task to finish
child.wait_for_status(status=['completed'])
reload all the metrics
child.reload()
get the metrics
print(child.get_last_scalar_metrics()) `