GentleSwallow91 what you are looking for is here 🙂
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
you should have something like 192.168... or 10.0 ....
Sounds good, I assumed that was the case but I was not sure.
Let's make sure that in the clearml.conf
we write it in the comment above the use_credentials_chain
option, so that when users look for IAM roles configuration they can quick search for it 🙂
SmugOx94 Yes, we just introduced it 🙂 with 0.16.3
Discussion was here (I'll make sure to update the issue that the version is out)
https://github.com/allegroai/trains/issues/222
In your trains.conf
add the following line:sdk.development.store_code_diff_from_remote = true
It will store the diff from the remote HEAD instead of the local one.
Bad news, there isn't a nice interface to get the table from the Optimizer object (I will make sure we add it, no reason not to).
But you can very easily get all the information you need and more:all_the_tasks = an_optimizer.get_top_experiments(top_k=100)
Then for every task in the list you can get All the information:for task in all_the_tasks: task_params_as_dict = task.get_parameters() task_scalars = task.get_last_scalar_metrics()
Basically the Task object enables you to que...
I found "scheduler" on allegroai github, is it something related to the case I want to make?
MoodyCentipede68 it is exactly what you are looking for 🙂
Do notice that you need to make sure you have your services queue configured and running for that to work 🙂
But the missing implementation of LogUniformRange for hpbandster still causes problems.
wdym?
Hmm GreasyLeopard35 can you specify the range you are passing to the HPO, as well as the type of optimization class ? (grid/random/optuna etc.)
Hi SourSwallow36
- The same docker image is used for all three jobs, just because it is easier to manage and faster to download. The full code is available on the trains-server GitHub. If you want to spin the containers manually, check the docker-compose.yml on the main repo, it has all the commands there
- Fork the trains-server, commit the changes and don't forget to PR them ;)
- Elastic search is a database, we use it to log all the experiments outputs, console logs metrics etc. This...
Hi RotundHedgehog76
I think it should work out of the box, I mean at the end both spin jupyter notebooks, which is what clearml interacts with. Are you getting any errors?
The problem is that I currently don't have a way to get them "from outside".
Maybe as a hack (until we add the model object)
` class MyModelCB:
current_args = dict()
@classmethod
def callback(load_save, model_info):
if load_save != "save":
return model_info
model_info.name = "my new name" + str(current_args) # make a name from args
return model_info
WeightsFileHandler.add_pre_callback(MyModelCB.callback)
MyModelCB.current_args = {"args": "value"} `wdyt?
I don't know how I would be able to get the description and name?
Good point, how about doing that in code, then you have all the information and you can store it in jsons / pickle next to the data folder?
wdyt?
DepressedChimpanzee34 <character> will almost always be converted into \ because otherwise it will not support \t or \n etc.
What I'm looking here is some logic that will allow us not to break backwards compatibility on the one hand, but still will allow you to have something like "first\second" entry.
WDYT? any ideas? (I really want to make sure we fix it as soon as possible)
BTW:str('\.') Out[4]: '\\.' str(('\.', )) Out[5]: "('\\\\.',)"
This is just python str casting
FYI:ssh -R 8080:localhost:8080 -R 8008:localhost:8008 -R 8081:localhost:8081 replace_with_username@ubuntu_ip_here
solved the issue 🙂
instead of the one that I want or the one of the env which it is started from.
The default is the python that is used to run the agent.agent.ignore_requested_python_version = true agent.python_binary = /my/selected/python3.8
"warm" as you do not need to sync it with the dataset, every time you access the dataset, clearml
will make sure it is there in the cache, when you switch to a new dataset the new dataset will be cached. make sense?
Hi UpsetBlackbird87
I might be wrong, but it seems like ClearML does not monitor GPU pressure when deploying a task to a worker rather rely only on its configured queues.
This is kind of accurate, the way the agent works is that you allocate a resource for the agent (specifically a GPU), then sets queues (plural) to listen to (by default priority ordered). Then each agent is individually pulling jobs and running on the allocated GPU.
If I understand you correctly, you want multiple ...
I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Yes this is fully supported and very easy to setup.
Regrading limiting users usage. This is doable, I think the easiest solution both for users and management of the cluster is introducing priority into the queue, basically a user can push job into low priority, and only some users can push into high...
correct on both.
notice that with upload
you can specify any storage (S3/GS/Azure atc)
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to i...
Could you download and send the entire log ?
Yes clearml is much better 🙂
(joking aside, mlops & orchestration in clearml is miles better)
CheerfulGorilla72 What are you looking for?
Ohh, two options:
From the script itself you can do:from clearml import Task task = Task.init(...) task.execute_remotely(queue='default')
Then run the script locally, it will get until the "execute_remotely call, quit the process and re-launch it on the "default" queue.
Option B:
Use the cleaml-task
$ clearml-task --folder <where the script is> --project ...
See https://github.com/allegroai/clearml/blob/master/docs/clearml-task.md#launching-a-job-from-a-local-script
StorageManager 🙂