Reputation
Badges 1
25 × Eureka!Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.
Hmm so maybe a "glob" alike parameter for get_local_copy(select_filter='subfolder/*')
?
so would that be "tags" "parents" ?
Hi @<1541954607595393024:profile|BattyCrocodile47>
Can you help me make the case for ClearML pipelines/tasks vs Metaflow?
Based on my understanding
- Metaflow cannot have custom containers per step (at least I could not find where to push them)
- DAG only execution. I.e. you cannot have logic driven flows
- cannot connect git repositories to different component in the pipeline
- Visualization of results / artifacts is rather limited
- Only Kubernetes is supported as underlying prov...
@<1523706266315132928:profile|DefiantHippopotamus88> seems like you are missing the ports π
CLEARML_WEB_HOST="
"
CLEARML_API_HOST="
"
CLEARML_FILES_HOST="
"
Any recommendation or working combinations of AMI
I would take the deeplearning AMIs from Nvidia AWS , I think they work on both CPU and GPU machines.
In terms of dockers, python dockers for CPU and nvidia runtime for GPU
[https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86d[β¦]d2c01646d599352f6ddd9893420eb815a06c3b90619f8?context=explore](https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86db7f6b1b286d2c01646d599352f6ddd98...
Hi SmallDeer34
ClearML automagical logging will work on the current python process. But in your example yyour Bash is running another python script (that has nothing to do with the original notebook), hence clearml automagic is not aware of it (i.e. it cannot "patch" the tensorboard calls).
In order to make it work.
you should do something like:from joeynmt import train train.main(...)
Or something similar π
Make sense ?
Hi @<1576381444509405184:profile|ManiacalLizard2>
Yeah that should work, assuming credentials are set in your clearml.conf
The idea is that it is not necessary, using the trains-agent you can not only launch the experiment on a remote machine, you can override the parameters, not just cmd line arguments, but any dictionary you connected with the Task or configuration...
OK, so if I've got, like, 2x16GB GPUs ...
You could do:clearml-agent daemon --queue "2xGPU_32gb" --gpus 0,1
Which will always use the two gpus for every Task it pulls
Or you could do:clearml-agent daemon --queue "1xGPU_16gb" --gpus 0 clearml-agent daemon --queue "1xGPU_16gb" --gpus 1
Which will have two agents, one per GPU (with 16gb per Task it runs)
Orclearml-agent daemon --queue "2xGPU_32gb" "1xGPU_16gb" --gpus 0,1
Which will first pull Tasks from the "2xGPU_32gb" qu...
Hi SmallDeer34
Hmm I'm not sure you can, the code will by default use rglob
with the last part of the path as wildcard selection
π
You can of course manually create a zip file...
How would you change the interface to support it ?
And when runningΒ
get
Β the files on the parent dataset will be available as links.
BTW: if you call get_mutable_copy() the files will be copied, so you can work on them directly (if you need)
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
Hi NastyFox63
What do you mean not all of them are shown?
Do they have diff series/titles, are they plots or scalars ? How are you reporting them ?
HandsomeCrow5 I see, my bad.
BTW: Did you see this one?
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
And the helper classes here: https://github.com/allegroai/trains/tree/master/trains/automation
Hi TrickyRaccoon92
If you are reporting to tensor-board, then "iteration" equals step. Is this the case?
Hi QuaintJellyfish58
This is odd, this "undefined" project is also marked as "Example" which would explain why you cannot delete it, but not how you ended up with one
Any idea on what changed on your server ?
This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)
Great if this is what you do how come you need to change the entry script in the ui?
Hi @<1585078763312386048:profile|ArrogantButterfly10>
Now i want to clone the pipeline and change the hyperparameters of train task, is it possible? If so, how??
the pipeline arguments are for the pipeline DAG/logic, you need to pass one of the arguments as an argument for the training step/task. Make sense ?
Do you have to have a value there ?
FileNotFoundError: [Errno 2] No such file or directory
Could it be the file you are trying to run is not in the repository ?
Are you running inside a docker ?
Any chance you can send the full log ?
BTW: if you want to sync between artifacts / settings, I would recommend calling task.reload() to get the latest values back from the server.
It was installed by 'pip install kwcoco' while my conda env was active.
Well I guess my question is, how does conda know ehere to install it form, if this is not on the public channels ? is there a specific conda channel you added (or preconfigured) ?
Hi MinuteGiraffe30
Are you saying that when you are running you code locally with a gitea repository, cleamrl incorrectly adds a link to gitlab ?
NastyOtter17
Usually the first report will happen after 30 seconds, could that be the difference ?
ColossalDeer61 btw, it turns out the docker-compose services docker was ill configured on the GitHub π I suggest you get the latest copy of it:curl
-o docker-compose.yml
Hmm, is there a way to do this via code?
Yes, clone the Task Task.clone
Then do data=task.export_task()
and edit the data object (see execution section)
Then update back with task.update_task(data)