Reputation
Badges 1
25 × Eureka!Maybe I can plot it using other lib.
I remember a while back there was integration with network visualization but it was hard to support and failed to many times...
If you have library that converts the network into html or image you can report it as debug sample?
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?
Ohh, I see now, yes that should be fixed as well 🙂
What's the "working dir" ? (where in the repo the script is executed from)
SoreDragonfly16 . In the hyper parameters Tab, you have "parallel coordinates" (next to the "add experiment" the button saying "values" press on it and there should be " parallel coordinates")
Is that it?
The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...
Does it work if I launch the clearml-agent on a docker and pip doesn't know the packages to install
Not sure I follow... the "detect_with_pip_freeze" flag (when set) will tell clearml (at runtime) to create the "installed packages" directly from pip freeze (instead of analyzing the code)
Ohh then we can definitely support it, could you maybe post a toy example for testing? Or even better PR it to the examples/tensorboardX folder?
There are also "completed, aborted, queued" .
Archived is actually a tag (system tag, not user tag). There is a "state machines" of moving from one state to the other. The special case is "published" that we probably should have called "locked". The idea is that if a Task/Model is published, you cannot reset it (and even deleting requires force flag).
I would use additional user tags (or even system-tags) to mark "deployed" state, wdyt?
PompousParrot44 I see what you mean, yes multiple context switching might cause a bit of decline in performance. not sure how much though ... The alternative of course is to set cpu affinity... Anyhow if you do get there we can try to come up with something that makes sense, but at the end there is no magic there 🙂
SubstantialElk6 when you say "Triton does not support deployment strategies" what exactly do you mean?
BTW: updated documentation already up here:
https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving
Yes 🙂
BTW: do you guys do remote machine development (i.e. Jupyter / vscode-server) ?
You might be able to write a script to override the links ... wdyt?
can the ClearML File server be configured to any kind of storage ? Example hdfs or even a database etc..
DeliciousBluewhale87 long story short, no 🙂 the file server, will just store/retrieve/delete files from a local/mounted folder
Is there any ways , we can scale this file server when our data volume explodes. Maybe it wouldnt be an issue in the K8s environment anyways. Or can it also be configured such that all data is stored in the hdfs (which helps with scalablity).I would su...
In theory it should have worked.
Can you send me the full Task log? (with cache and everything?)
I suspect since these are not the default folders, something is misconfigured / missing
(you can DM the log, so it won't end on a public the channel))
you could also use:
https://github.com/allegroai/clearml/blob/ce7e77a00e869a2690f31cbc578636ce88bc4613/docs/clearml.conf#L188
and setup the clearml.conf on the users machine to automatically log the environment variables at run time (stored under the Configuration tab).
Then the agent will pull these same variables at execution time and set them
MagnificentSeaurchin79
"requirements.txt" is ignored if the Task has an "installed packges" section (i.e. not completely empty) Task.add_requirements('pandas') needs to be called before Task.init() (I'll make sure there is a warning if called after)
Hi @<1684735407637401600:profile|WonderfulJellyfish65>
BTW, the training script connects to apiserver via the internal IP address
That is a big issue, because as you noticed the links to data =generated by the code will have the internal IP ...
You basically need every component to use the same address (url)
with tensorboard logging, it works fine when running from my machine, but not when running remotely in an agent.
This is odd, could you send the full Task log?
Hmm, let me check, there is a chance the level is dropped when manually reporting (it might be saved for internal critical reports). Regardless I can't see any reason we could not allow to control it.
in the UI, find the task (just search for the Task ID, it will find it), then tight click it, and select "reset"
Could not find a version that satisfies the requirement pytorch~=1.7.1
Seems like pytorch 1.7.1 has no package for python 3.7 ?
docstring ?
Usually the preferred way is StorageManager
https://clear.ml/docs/latest/docs/references/sdk/storage
https://clear.ml/docs/latest/docs/integrations/storage
Why is it using an OutputModel and an InputModel?
So calling OutputModel will create the new Model entity and upload the data, InputModel will store it as required input Model.
Basically on the Task you have input & output section, when you clone the Task you are copying the input section into the newly created Task, and the assumption is that when you execute it, your code will create the output section.
Here when you clone the Task you will be clone the reference to the InputModel (i...
Hi SmallDeer34
On the SaaS you can right click on an experimenter and publish it 🙂
This will make the link available for everyone, would that help?
task.set_script(working_dir=dir, entry_point="my_script.py")Why do you have this part? isn't it the same code, the script entry point is auto detected ?
... or when I run my_script.py locally (in order to create and enqueue the task)?
the latter, When the script is running locally
So something like
os.path.join(os.path.dirname(file), "requirements.txt")
is the right way?
Sure this will work 🙂
