Reputation
Badges 1
25 × Eureka!Quick update, I found the issue, working on a fix 🙂
I pull all the parameters, and then manually filter on the HP keys (manually=I have to plug them in, they are not part of optimizer object)
So is this an improvement to optimizer._get_child_tasks_ids(...) interface ?
e.g. return a structure like:[ { 'id': task_id, 'hp1': value, 'hp2': value, 'hp3': value, 'objective': dict(title='title', series='series', value=42 }, ]
Hi @<1674588542971416576:profile|SmarmyGorilla62>
You mean on your elastic / mongo local disk storage ?
I mean the caching will work, but it will reinstall this repository on top of the cached copy.
make sense ?
Profile page top left corner
WittyOwl57 this is what I'm getting on my console (Notice New lines! not a single one overwritten as I would expect)
` Training: 10%|█ | 1/10 [00:00<?, ?it/s]
Training: 20%|██ | 2/10 [00:00<00:00, 9.93it/s]
Training: 30%|███ | 3/10 [00:00<00:00, 9.89it/s]
Training: 40%|████ | 4/10 [00:00<00:00, 9.87it/s]
Training: 50%|█████ | 5/10 [00:00<00:00, 9.87it/s]
Training: 60%|██████ | 6/10 [00:00<00:00, 9.88it/s]
Training: 70%|███████ | 7/10 [00:00<00...
JitteryCoyote63 I think there is a ClearML logger , no?
Hi UnevenDolphin73 , are those per user/project/system environment variables ?
If these are secrets (that you do not want to expose), maybe it is best just to have them on he agent's machine ?
BTW, I think there is some "vault" support in the paid tiers for these kind of secret, not sure on which level (i.e. user/system/project)
Maybe failed pipelines with zero steps count as completed
zero steps counts as successful.
That said, how could it have zero steps if one of the steps failed? no?
All the 3 steps can be found here:
https://github.com/allegroai/trains/tree/master/examples/pipeline
I just called exit(0) in a notebooke and it closed it (the kernel) no exception
To be honest, I'm not sure I have a good explanation on why ... (unless on some scenarios an exception was thrown and caught silently and caused it)
but we run everything in docker containers. Will it still help?
As long as you are running with clearml-agent(in docker mode), all the cache folders (this one included) are mounted on the host machine for persistency
Also, How do I make the files other than entry script visible to the job?
The assumption for clearml (regradless on how you create a Task) is that you code is either a standlone script (or jupyter notebook) or inside a git repository. In case of a git repository cleamrl-agent will clone the git repository of the code, apply the uncommitted changes and run your code.
StaleButterfly40 are you sure you are getting the correct image on your TB (toy255) ?
single task in the DAG is an entire ClearML
pipeline
.
just making sure detials are not lost, "entire ClearML pipeline ." : the pipeline logic is process A running on machine AA.
Every step of that pipeline can be (1) subprocess, but that means the exact same environement is used for everything, (2) The DEFAULT behavior, each step B is running on a different machine BB.
The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and tr...
Hi @<1524560082761682944:profile|MammothParrot39>
By default you have the last 100 iterations there (not sure why you are only seeing the last 3), but this is configurable:
None
BattyLion34 Okay, I'll try to see if we can solve the multi-instance issue on Windows (because obviously it should be automatic)
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
Could you send me the cosnole log of both tasks, failing and passing one?
BattyLion34 is this consistent?
(Really I can't see eny difference, one time it is able to create the venv and another it is failing with permission error)
Thanks BattyLion34 I fixed the code snippet :)
UnevenDolphin73 something like this one?
https://github.com/allegroai/clearml/pull/225
Hi NastyOtter17
"Project" is so ambiguous
LOL yes, this is something GCP/GS is using:
https://googleapis.dev/python/storage/latest/client.html#module-google.cloud.storage.client
How can the first process corrupt the second
I think that something went wrong and both Agents are using the same "temp" folder to setup the experiment.
why doesn't this occur if I run pipeline from command line?
The services queue is creating new dockers with everything in them so they cannot step on each others toes (so to speak)
I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine....
Hi RobustRat47
My guess is it's something from the converting PyTorch code to TorchScript. I'm getting this error when trying the
I think you are correct see here:
https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/examples/pytorch/train_pytorch_mnist.py#L136
you have to convert the model to TorchScript for Triton to serve it
Hi BattyLion34
I might have a solution, in order to make sure the two agents are not sharing the "temp" folder:
create two copies of ~/clearml.conf , let's call them :
~/clearml_service.conf ~/clearml_agent.confThen in each one select a different venvs_dir see here:
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L90
for example:
~/.clearml/venvs-builds1 ~/.clearml/venvs-builds2Now start the two agents with:
The service age...