Reputation
Badges 1
25 × Eureka!Hi @<1631102016807768064:profile|ZanySealion18>
ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.
what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?
I don't know whether you have access to the backend,
Creepy , no I do not π
I can't make anything appear in the console part of the ui
clearml_task.logger.report_text("some text")
should work
connect_configuration
seems to take about the same amount of time unfortunately!
I think it is a better solution, that said from your description it sounds the issue is the upload bandwidth (i.e. json-ing the dict itself), could that be it?
(and even 1000 entries seems like something that would end up at 1mb upload, that is not that much)
Nice π
@<1523710674990010368:profile|GreasyPenguin14> for future reference the agent
part in the clearml.conf is only created when you call clearml-agent init (no need for it for the python SDK). Full default configuration is here:
None
You can however change the prefix, and you can always have access to these links.
Any reason for controlling the exact output destination ?
(BTW: You can manually upload via StorageManager, and then register the uploaded link)
Correct (copied == uploaded)
It is for storing the predictions a trained model makes, so two different models do create slightly different images
That actually makes sense.
So how would you create exactly the same file (i.e. why do you need to manually control the upload folder, wouldn't creating a new unique folder suffice ?)
BTW: GreasyPenguin14 you can also upload them as debug samples (when setting the output_uri, the debug samples will be uploaded to the same destination)
https://github.com/allegroai/clearml/blob/6b9297660e0ed83a77bce3da2fab384c552206fd/examples/reporting/image_reporting.py#L21
Hi GreasyPenguin14
Sure you can, although a bit convoluted (I'll make sure we have a nice interface π )import hashlib title = hashlib.md5('epoch_accuracy_title'.encode('utf-8')).hexdigest() series = hashlib.md5('epoch_accuracy_series'.encode('utf-8')).hexdigest() task_filter = { 'page_size': 2, 'page': 0, 'order_by': ['last_metrics.{}.{}'.format(title, series)] } queried_tasks = Task.get_tasks(project_name='examples', task_filter=task_filter)
Will they get ordered ascending or descending?
Good point, I'll check the docs... but I think they do not specify
https://clear.ml/docs/latst/docs/references/sdk/task#taskget_tasks
From the code it seems the ordered is not guaranteed.
You can however pass '-last_update'
: order_by
which will give you the latest updated first
` task_filter = {
'page_size': 2,
'page': 0,
'order_by': ['last_metrics.{}.{}'.format(title, series), '-last_update']
}
Task.get_tasks(...
Still feels super hacky tho, think it would be nice to have a simplier way or atleast some nice documentationΒ
YES you are absolutely correct, we should add it to the Task interface.
Any chance you add a GitHub issue so we do not forget ?
ETA for the next release is end of the month/early March, it is planned to include many other improvements π
Hi GreasyPenguin14
Yes, I think you are right the series name should be next to the title. Let me check it...
EcstaticGoat95 any chance you have an idea on how to reproduce? (even 1 out of 6 is a good start)
In the side bar you get the title of the graphs, then when you click on them you can see the diff series on the graphs themselves
BTW: get_tasks has project_name argument, I would just use it π
That is odd, can you send the full Task log? (Maybe some oddity with conda/pip ?!)
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?
New version will contain much more advanced search (including all the task fields)
are there any more fields in this function with partial matching? for example project? tags?
Yes they can all be filtered (basically everything you see in the UI)
notice: tags are strings (you can provide list of tags), project is an ID of the project
(Use Task.get_project_id, I think)
Oh I see, yes the "metrics" include both scalars / plots & console outputs,
I also think they are updated only once a day (or maybe twice a day?) so even if you delete them it will take to update
(archive is not delete, you then need to go to the archived view and delete it from there)
Hi JitteryCoyote63
cleanup_service task in the DevOps project: Does it assume that the agent in services mode is in the trains-server machine?
It assumes you have an agent connected to the "services" queue π
That said, it also tries to delete the tasks artifacts/models etc, you can see it here:
https://github.com/allegroai/trains/blob/c234837ce2f0f815d3251cde7917ab733b79d223/examples/services/cleanup/cleanup_service.py#L89
The default configuration will assume you are running i...
PompousBeetle71 so in one project the experiment works as expected, while in the other it fails on credentials ? both running on the same trains-agent machine ?
Hi StaleButterfly40
but if I sync more than once I get a duplication of each line in log
Hmm.. let me check if we can "force" overwriting (it might require you to have a more stateful code for the sync process)
sometime we resume training
How would that work in offline mode? The offline process cannot sync with the backend... Are you saying you would like to get a new capability, "continue-offline-session" ?
Thanks GreasyPenguin66 ! please keep us updated π
PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
PompousParrot44 unfortunately not yet π
But the gist is :
MongoDB stores experiment data (i.e. execution parameters, git ref etc.)
ElasticSearch stores results (i.e. metrics console logs, debug image links etc.)
Does that help?
Okay yes, that's exactly the reason!! Cross origin blocks the file link