Reputation
Badges 1
25 × Eureka!Hi @<1561885941545570304:profile|PunyKangaroo87>
What do mean by store data locally?
Like clearml-data? I.e Dataset?
You can always use file:///root/path/folder as destination, this will store everything into the local folder, is that it?
BTW:
This is very odd "~/.clearml/venvs-builds.3/3.6/bin/python" it thinks it is using "python 3.6" but it is linked with python 2.7 ...
No idea how that could happen
Hi UnevenDolphin73 , are those per user/project/system environment variables ?
If these are secrets (that you do not want to expose), maybe it is best just to have them on he agent's machine ?
BTW, I think there is some "vault" support in the paid tiers for these kind of secret, not sure on which level (i.e. user/system/project)
After it finishes the 1st Optimzation task, what's the next job which will be pulled ?
The one in the highest queue (if you have multiple queues)
If you use fairness it will pull in round robin from all queues, (obviously inside every queue it is based on the order of jobs).
fyi, you can reorder the jobs inside the queue from the UI 🙂
DeliciousBluewhale87 wdyt?
You could change infrastructure or hosting, and now your data is associated with the wrong URL
Yeah that makes sense, so have it on a specific dns name? (this is usually the case with k8s deployments)
This is odd it says 1.0.0 but then, it was updated t weeks ago ...
For classification it's F1 score but for other task it maybe and I don't think that's problem. we just have to log it right?
Correct 🙂
Give me few days, I will work on your sugestions and then let you know if I am not able to do this
Sounds good!
BTW:previous_tasks = Task.get_tasks(task_filter={'tags': 'best'}) local_model_file = previous_tasks[0].artifcats['my_model'].get_local_copy()
Probably not the case the other way around.
Actually the other way around, new pip version uses new package dependency resolver that can concluded that a previous package setup is not supported (because of version conflicts) even though they worked...
It is tricky, pip is trying to get better at resolving package dependencies, but it means that old resolutions might not work which would mean old environments cannot be resorted (or "broken" env). This is the main reason not to move to p...
Yes, albeit not actually "intercept" as the user will be able to directly put Task sin queues B_machine_a/B_machine_b , but any time the user is pushing Tasks into queue B, this service will pull it and push to the individual machines queue.
what do you think?
Yes, consider VexedCat68 txt file the Dataset "content" , this will enable ypu to safely get the list of files, and then you can use the StorageManager to download them extend this concept and have it built into the Dataset itself, i.e. allow you to add files as links and make sure it will just download them. The caveat here is that the Dataset at the end, returns a folder with the files, when you specify links, you have to also specify the target location locally (at the end you want a fol...
Does this mean that I need to create multiple ssh keys? 1 key for each user?
I think so
Use .git-credentials
This might also support multiple user/repo
UpsetBlackbird87pipeline.start()Will launch the pipeline itself On a remote machine (a machine running the services agent).
This is why your pipeline is "stuck" it is not actually running.
When you call start_lcoally() the pipeline logic itself is runnign on your machine and the nodes are running on the workers.
Makes sense ?
Hi @<1526371965655322624:profile|NuttyCamel41>
. I do that because I do not know how to get the pickle file into the docker container
What would the pickle file do?
and load the MinMaxScaler within the script, as the sklearn dependency is missing
what do you mean by that? are you getting an error when loading your model ?
I would like to force the usage of those requirements when running any script
How would you force it? Will you just ignore the "Installed Packages" section ?
and this?avg(100*increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m]))
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is 🙂
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunate...
Can't say I have noticed that, is this a delay on the send ? Which for some reason is correlated with the epochs ? What was the case with 0.17.5?
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Hi RipeGoose2
What exactly is being uploaded ? Are those the actual model weights or intermediate files ?
You can try just pulling the "metric" section of the Task, but I cannot imaging the network bandwidth is the issue?
Could it be load on the clearml-server (i.e. it needs to handle lots of requests ?)
CharmingStarfish14 can you check something from code, just to see if this would solve the issue?
JitteryCoyote63 while it's running, could you give me a few details on the setup, maybe I can reproduce it.
Is it using pytorch distributed ?
Are all models uploaded to S3 ?
etc.
Then in theory (since the backend is python based) you just need to find a base docker image to build it on.
So if you are using the latest clearml (i.e. +1.3) reenqueuing the pipline will automatically continue it from where it stopped.
With previous versions (which is your case, I think), you clone the pipeline Task, change the parameter and enqueue it.
(The state itself of the pipeline is stored on the Task, and when you clone it, you are cloning the state as well).
Make sense ?
it overwrites the previous run?
It will overwrite the previous if
Under 72h from last execution no artifact/model was createdYou can control it with "reuse_last_task_id=False" passed to Task.init
Task name itself is Not unique in the system, think of it as short description
Make sense ?
Hmm can you try with additional configuration, next to "secure: true" in your clearml.conf, can you add "verify: false"
Hi UpsetBlackbird87
This is an Optuna decision on how many concurrent tests to run simultaneously.
You limited it to 100, but remember Optuna does a Bayesian optimization process, where it decides on the best set of arguments based on the performance of the previous set, this means it will first try X trials, then decide on the next batch.
That said you can a pruner to Optuna specifying how it should start
https://optuna.readthedocs.io/en/v1.4.0/reference/pruners.html#optuna.pruners.Median...
BTW: the same hold for tagging multiple experiments at once