Hm GiganticTurtle0 let me check quickly it
Ohh try to add --full-monitoring to the clearml-agent execute
None
Please go ahead with the PR π
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
Ohh if this is the case, and this is a stream of constant inference Results, then yes, you should push it to some stream supported DB.
Simple SQL tables would work, but for actual scale I would push into a Kafka stream then pull it (serially) somewhere else and push into a DB
CharmingBeetle38 try adding "General/" before the arguments. This means batch_size becomes General/batch_size. This is only because we are accessing the parameters externally, when the task is executed it is resolved automatically
If you choose between skipping or logging like nan, then here I find it difficult, it seems that it is better to log than skip, but you need to think.
So I "think" the issue is plotly (UI), doesn't like NaN and also elastic (storing the scalar) is not a NaN fan. We need to check if they both agree on the representation, that it should be easy to fix...
Maybe you could open a github issue, so we do not forget?
It seems that I solved the problem by moving all of the local codeΒ (local repos) imports to after the Task.init
PunyPigeon71 I'm confused, how did that solve the issue on the remote machine?
DeterminedToad86 were you running a jupyter notebook or a jupyter console ?
Actually scikit implies joblib π (so you should use scikit, anyhow I'll make sure we add joblib as it is more explicit)
What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models.
This is very aligned with the goals of ClearML π
I would to understand more on what is currently missing in ClearML so we can better support this approach
my inexperience in using them a lot until recently. I can see how that is a better solution
I think I failed in explaining my self, I me...
Hi @<1547390415320125440:profile|SilkySparrow85>
because it is trying to send a debug-sample to fileserver!
Yes, you should always configure the "files server" to point to your minio S3, basically:
None
files_server: "
"
But do not forget to also configure the credentials here:
[None](https://github.com/allegroai/clearml/blob/40c6db9d95016382c721546d42...
Maybe I can plot it using other lib.
I remember a while back there was integration with network visualization but it was hard to support and failed to many times...
If you have library that converts the network into html or image you can report it as debug sample?
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?
Ohh, I see now, yes that should be fixed as well π
What's the "working dir" ? (where in the repo the script is executed from)
SoreDragonfly16 . In the hyper parameters Tab, you have "parallel coordinates" (next to the "add experiment" the button saying "values" press on it and there should be " parallel coordinates")
Is that it?
The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...
Does it work if I launch the clearml-agent on a docker and pip doesn't know the packages to install
Not sure I follow... the "detect_with_pip_freeze" flag (when set) will tell clearml (at runtime) to create the "installed packages" directly from pip freeze (instead of analyzing the code)
Ohh then we can definitely support it, could you maybe post a toy example for testing? Or even better PR it to the examples/tensorboardX folder?
There are also "completed, aborted, queued" .
Archived is actually a tag (system tag, not user tag). There is a "state machines" of moving from one state to the other. The special case is "published" that we probably should have called "locked". The idea is that if a Task/Model is published, you cannot reset it (and even deleting requires force flag).
I would use additional user tags (or even system-tags) to mark "deployed" state, wdyt?
PompousParrot44 I see what you mean, yes multiple context switching might cause a bit of decline in performance. not sure how much though ... The alternative of course is to set cpu affinity... Anyhow if you do get there we can try to come up with something that makes sense, but at the end there is no magic there π
SubstantialElk6 when you say "Triton does not support deployment strategies" what exactly do you mean?
BTW: updated documentation already up here:
https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving
Yes π
BTW: do you guys do remote machine development (i.e. Jupyter / vscode-server) ?
You might be able to write a script to override the links ... wdyt?