Reputation
Badges 1
25 × Eureka!I guess itβs on me to check whether this slowdown is negligible or not
Usually performance is negligible, especially with GPU
But if you really want the best:
Add --security-opt seccomp=unconfined
to the extra_docker_arguments
See detials:
https://betterprogramming.pub/faster-python-in-docker-d1a71a9b9917
Sorry if it's something trivial. I recently started working with ClearML.
No worries, this has actually more to do with how you work with Dask
The Task ID is the unique id of the any Task in the system (task.id will return the UID str)
Can you post a toy Dash code here, I'll explain how to make it compatible with clearml π
Hi SubstantialElk6
We try to push a fix the same day a HIGH CVE is reported, that said since the external API interface is relatively far away from DBs / OS, and since as a rule of thumb, authorized users are trusted (basically inherit agent code execution means they have to be), it is an exception to have a CVE that affects the system. I think even this high profile one, does not actually have an effect on the system as even if ELK is susceptible (which it is not), only authorized users co...
a bit sad that there is no working integration with one of the leading time series framework...
You mean a series darts reports ? if it does report it, where does it do so? are you suggesting we have Darts integration (which sounds like a good idea) ?
I think it should be treated as failed,
I'm not sure where I stand on default behavior, it it could easily be an argument for the pipeline controller
That was the idea behind the feature (and BTW any feedback on usability and debugging will be appreciated here, pipelines are notorious to debug π )
the ability to exexute without an agent i was just talking about thia functionality the other day in the community channel
What would be the use case ? (actually the infrastructure now supports it)
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere π
Hi VirtuousFish83 ,
Is it throwing an exception? Are you seeing the plot in the UI but the title is incorrect?
Hi SteadyFox10 the way it works is that Trains limits the debug image history by reusing the same files names, so the UI will only present the iterations where the debug images are relevant for. With your sample code it looks like it exposes a bug , the generated link should contain iteration number, it does not and so it overwrites the debug images every iteration. Here is the image link: https://demofiles.trains.allegro.ai/Test/test_images.6ed32a2b5a094f2da47e6967bba1ebd0/metrics/Test/te...
SmarmySeaurchin8args=parse.parse() task = Task.init(project_name=args.project or None, task_name=args.task or None)
You should probably look at the docstring π
:param str project_name: The name of the project in which the experiment will be created. If the project does
not exist, it is created. If project_name
is None
, the repository name is used. (Optional)
:param str task_name: The name of Task (experiment). If task_name
is None
, the Python experiment
...
but I am think they done it for a reason no?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
At the top there should be the URL of the notebook (I think)
copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub
- In a notebook, create a method and decorate it by fastai.scriptβs
@call_parse
.Any chance you have a very simple code/notebook to reference (this will really help in fixing the issue)?
BTW: there is a full Pipeline class that does everything for you, example here:
https://github.com/allegroai/clearml/tree/master/examples/pipeline
@<1577468638728818688:profile|DelightfulArcticwolf22>
How can I tell clearml-agent not to run pip install unless my requierments.txt file was changed.
the agent has built in cache, it will reuse the previous venv if nothing changed (cache local on the agent's machine).
Make sure this is line is not commented :
None
Martin I told you I can't access the resources in the cluster unfortunately
π
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?
with ?
multipart: false
secure: false
If so, can you post here your aws.s3 section of the clearml.conf? (of course replacing the actual sensitive information with *s)
It's always the details... Is the new Task running inside a new subprocess ?
basically there is a difference between
remote task spawning new tasks (as subprocesses, or as jobs on remote machine), remote task still running remote task, is being replaced by a spawned task (same process?!)UnevenDolphin73 am I missing a 3rd option? which of these is your case?
p,s. I have a suspicion that there might be a misuse of "Task" here?! What are you considering a Task? (from clearml perspective a Task...
where people can do @'s for experiments/projects/tasks and even comparisons ...
ohhh I like that! for me this throws me directly to Slack integration .
I think my main question is, "is the discussion ephemeral?" in other words, is this an on going discussion that later no one will care about, or are we creating some "knowledge base" that we want to later share?
Also, by "address bar at the top", i assume you mean address url right?
yes... apologies for the phrasing, it was w...
Hmm I cannot think of something that will provide something a per user basis.
Wouldn't a global set of credentials that the agent is using be enough ?
(on the local machine, user can keep using the "definitions.py")
DefiantHippopotamus88 you are sending the curl to the wrong port , it should be 9090 (based on what remember from the unified docker compose) on your setup
BitingKangaroo95 nice work π
I think that what did it was:
change the sshd_config
so that it allows port forwarding
, agent forwarding
and x11 forwarding
But just in case, it might be there was a pre existing SSH identifier on your machine, and hence the error.
clear known_hosts under ~/.ssh was also something I would try π
JitteryCoyote63 next week is the Trains next release with upgrade to ES 7, do you want to wait or sort a solution for this one ?
(BTW: I think that you can mount a license file or delete one, and it should be okay, I'll ask the backend guys regradless)
Hi RoughTiger69
unfortunately, the model was serialized with a different module structure - it was originally placed in a (root) module called
model
....
Is this like a pickle issue?
Unfortunately, this doesnβt work inside clear.ml since there is some mechanism that overrides the import mechanism using
import_bind
.
__patched_import3
What error are you getting? (meaning why isn't it working)
so I assume clearml moves them from one queue to the other?
Correct. When it creates the k8s job and launches it on the cluster it moves it into the queue.
Can you see it on your k8s cluster (meaning the job/pod)?
clearml_agent: ERROR: Can not run task without repository or literalscript in
script.diff
This is odd ...
OutrageousSheep60 when you launch clearml-session
it tells you the session ID (which is also a Task ID), can you look for it in the UI and check there is something in the repo/uncommitted-changes section ?
Hi RipeGoose2
Yes the slide feature is definelty on the do do list (a lot of users asked for it).
Unfortunately other than actually PR-ing to the UI repo, there is no easy way to add customization (If you have an idea on how we could have an easy interface, that would be great.)
I'll check what's the status with the slider, maybe we will be lucky enough to see it in he next update π