
Reputation
Badges 1
25 × Eureka!I'm really for adding an interface, but I was not able to locate a simple integration option with basically anything, Wdyt ?
Hi @<1610808279263350784:profile|FriendlyShrimp96>
Is there a way to get a list of variants given a metric, or even just a full list of metrics and variants for a given task id?
Try this
None
from clearml.backend_api.session.client import APIClient
c = APIClient()
metrics = c.events.get_task_metrics(tasks=["TASK_ID_HERE"], event_type="training_debug_image")
print(metrics)
I think API ...
Thanks JitteryCoyote63 , once we have a reproducible example the fix should be very quick to push (with these things reproducing it is the challenge)
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L...
SubstantialElk6
Regrading cloning the executed Task:
In the pip requirements syntax, "@" is a hint that tells pip where to find the package if it is not preinstalled.
Usually when you find the @ /tmp/folder
It means the packages was preinstalled (usually pre installed in the docker).
What is the exact scenario that caused it to appear (this was always the case, before v1 as well).
For example zipp
package is installed from pypi be default and not from local temp file.
Your fix b...
RoundMosquito25 how is that possible ? could it be they are connected to a different server ?
MotionlessCoral18 I think there is a fix in the latest clearml-agent RC 1.4.0rc0 can you test and update if your are still having this issue?
@<1556812486840160256:profile|SuccessfulRaven86> is the issue with flask
reproducible ? if so could you open a github issue, so we do not forget to look into it?
Hi @<1547028116780617728:profile|TimelyRabbit96>
Notice that if running with docker compose you can pass an argument to the clearml triton container an use shared mem. You can do the same with the helm chart
Hi @<1547028074090991616:profile|ShaggySwan64>
. If I have a local repo cloned with ssh, the agent will attempt to replace the repo url with https,
Yes if you provide git user/pass (or user / app-pass) the agent would automatically replace and ssh:// repo link with the equivalent https:// and user the user/pass for authentication
but it seems that it doesn't remove the 2222 port in my case. That leads to
Hmm,,, what's the clearml-agent version? if this is not the latest 2.0.0r...
I want the model to be stored in a way that clearml-serving can recognise it as a model
Then OutputModel or task.update_output_model(...)
You have to serialize it, in a way that later your code will be able to load it.
With XGBoost, when you do model.save clearml automatically picks and uploads it for you
assuming you created the Task.init(..., output_uri=True)
You can also manually upload the model with task.update_output_model or equivalent with OutputModel class.
if you want to dis...
is no agent listening to the "k8s_scheduler"
There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)
unfortunately I can't get info from the cluster
You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?
Hi DepressedChimpanzee34
if you try to extend it more then the width of the column to the right, it doesn't do anything..
You mean outside of the window? or are you saying you cannot extend it?
Just verifying, we are talking about the latest version of clearml-server ?
LOL, thanks!
Sure thing, any vanilla AMI will work, as long as it has python3 and docker preinstalled (obviously if you need GPU support than drivers preinstalled as well)
Ohh I see now, okay there are two entries on an artifact, the actual artifact (link to file somewhere) and the text preview of the artifact . I think the "preview" is the issue
But pytorch has no specific backend, it uses TB.
No?! Can you point me to an example? What I mostly find is how to calc metrics not standard way to then store them...
No, clearml uses boto, this is internal boto error, which points bucket size limit, see the error itself
Great to hear SourSwallow36 , contributions are always appreciated 🙂
Regrading (3), MongoDB was not build for large scale logging, elastic-search on the other hand was build and designed to log millions of reports and give you the possibility to search over them. For this reason we use each DB for what it was designed for, MongoDB to store the experiment documents (a.k.a env, meta-data etc.) and elastic-search to log the execution outputs.
Also, I would like to add some other plots t...
Something like the TYPE_STRING that Triton accepts.
I saw the github issue, this is so odd , look at the triton python package:
https://github.com/triton-inference-server/client/blob/4297c6f5131d540b032cb280f1e[…]1fe2a0744f8e1/src/python/library/tritonclient/utils/init.py
Hi RobustRat47
What do you mean by "log space for hyperparameter" , what would be the difference ? (Notice that on the graph itself you can switch to log scale when viewing in the UI) ?
Or are you referring to the hyper parameter optimization, allowing you to add log space ?
DeliciousSeal67
are we talking about the agent failing to install the package ?
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all 🙂 there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.