
Reputation
Badges 1
25 × Eureka!hmm that is odd, let me check
This will mount the trains-agent machine's hosts file into the docker
No sure I follow, you mean to launch it on the kubernretes cluster from the ClearML UI?
(like the clearml-k8s-glue ?)
Oh, then no, you should probably do the opposite 🙂
What is the flow like now? (meaning what are you using kubeflow for and how)
You mean the entire organization already has Kubeflow, or to better organize something (if this is the second, what are we organizing, pipelines?)
This is odd, what is the parameter?
I assume it needs sorting and one time this is Integer, and the next it is a String, so the server cannot sort based on it. Could that be ?
is it consistent ? (the error), meaning it happens on other integer values ?
What's the clearml-server you are running ?
Is this reproducible with the hydra example?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
AttractiveCockroach17 I verified this is an issue with hypeparemeters with "." or section names with ".", thank you for noticing!
I will make sure I pass it along, should be part of the next version (ETA a week) 🙂
the second seems like a botocore issue :
https://github.com/boto/botocore/issues/2187
Hi SmugSnake6
I think it was just fixed, let me check if the latest RC includes the fix
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?
I guess I just have to make sure that total memory usage of all parallel processes are not higher than my gpu's memory.
Yep, unfortunately I'm not aware of any way to do that automatically 🙂
Wait, how did you end up withclearml_task_id = os.environ['CLEARML_TASK_ID']
printing "01b77a220869442d80af42efce82c617" ?
This means you are running by an agent?!
ReassuredTiger98
Okay, but you should have had the prints ...uploading artifact
anddone uploading artifact
So I suspect something is going on with the agent.
Did you manage to run any experiment on this agent ?
EDIT: Can you try with artifacts example we have on the repo:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
But a warning instead of an error would be good.
Yes, that makes sense, I'll make sure we do that
Does this sound like a reasonable workflow, or is there a better way maybe?
makes total sense to me, will be part of next RC 🙂
Hi @<1523708901155934208:profile|SubstantialBaldeagle49>
If you report on the same iteration with the same title/series you are essentially overwriting the data (as expected)
Regrading the plotly report size.
Two options:
- round down numbers (by default it will store all the digits, and usually after the forth it's quite useless, and it will drastically decrease the plot size)
- Use logger.report_scatter2d , it is more efficient and has a mechanism to subsample extremely large graphs.
Hi MelancholyChicken65
I'm not sure you an control it, the ui deduces the URL based on the address you are browsing to: so if you go yo http://app.clearml.example.com you will get the correct ones, but you have to put them on the right subdomains:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#subdomain-configuration
Yeah, but I still need to update the links in the clearml server
yes... how many are we talking about here?
CrookedWalrus33 from the log it seems the code is trying to use "kwcoco" but it is not listed under any "Installed packages" nor do you see any attempt to install it. Can you confirm ?
SmarmyDolphin68
Debug Samples tab and not the Plots,
Are you doing plt.imshow
?
Also make sure you have report_image=False
when calling the report_matplotlib_figure
(if it is true it will upload it as an image to "debug samples")
JitteryCoyote63 I think I found the bug in clearml-task
it adds it at the end instead of before everything else