Reputation
Badges 1
25 × Eureka!BTW: CloudyHamster42 I think this issue was discussed on GitHub, and the final "verdict" was we should have an option to split/combine graphs on the UI side (i.e. similar to the "smoothing" or wall-time axis etc.)
Ohh, yes that makes sense so just send them as a list of links in a single calldataset.source_url(["s3://", "s3://"], ...)
This will be a single update
https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/clearml/datasets/dataset.py#L430
Hurray conda.
Notice it does include cudatoolkit , but conda ignores it
cudatoolkit~=11.1.1
Can you test the same one only serach and replace ~= with == ?
Hi StaleHippopotamus38
I imagine I could make the changes specified in the warning toΒ
/etc/security/limits.conf
Yep seems like elastic memory issue, but I think the helm chart takes care of it,
You can see a reference in the docker compose:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L41
I'm sorry wrong line reference:
I'm assuming the error is due to ulimit missing:
try adding 16777216 to both soft/hard ulimit
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L58
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25
Can you test with the hydra example? if the example works, any chance you can send a toy to reproduce it ?
https://github.com/allegroai/clearml/tree/master/examples/frameworks/hydra
Regrading the helm, how did you get the link, ? http://github.io ? and the subdomain allegroai?
Yes, though the main caveat is the data is not really immutable π
ElegantCoyote26 point me to where Keras stores the data π
If in the process of integration you had to add a logger/callback to your Keras code, that is the equivalent of using the TB.
Hi NastyFox63
What do you mean not all of them are shown?
Do they have diff series/titles, are they plots or scalars ? How are you reporting them ?
is no agent listening to the "k8s_scheduler"
There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)
unfortunately I can't get info from the cluster
You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?
Question - why is this the expected behavior?
It is π I mean the original python version is stored, but pip does not support replacing python version. It is doable with conda, but than you have to use conda for everything...
They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
You should not have this entry in the conf file, the "worker_id" should be unique (and is based on the "worker_name" as a prefix. You can control it via env variales:CLEARML_WORKER_ID
Yes the clearml-server AMI - we want to be able to back it up and encrypt it on our account
I think the easiest and safest way for you is to actually have full control over the AMI, and recreate once from scratch.
Basically any ubuntu/centos + docker and docker-compose should do the trick, wdyt ?
Hi PompousParrot44
So do you mean something like:
` task_model_a = Task.get('id_a')
task_model_b = Task.get('id_b')
model_a_file = task_model_a.models['output][-1].get_local_copy()
model_b_file = task_model_b.models['output][-1].get_local_copy() `
your account has 2FA enabled and you must use a personal access token instead of a password.
I'm assuming you have created the personal access token and used it, not the pass
Hmm I see what you mean. It is on the roadmap (ETA the next version 0.17, 0.16 is due in a week or so) to add multiple models per Task so it is easier to see the connections in the UI. I'm assuming this will solve the problem?
BitterStarfish58 I would suspect the upload was corrupted (I think this is the discrepancy between the files size logged, to the actual file size uploaded)
Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
Thanks!
In the conf file, I guess this will be where ppl will look for it.
It reflects what is stored by Keras, so if Keras stores the best model this is what you get. BTW if you pass output_uri=True it will automatically upload the models
HI FranticCormorant35 , the Reporter is internal implementation the Logger uses. In general you should use the Logger.
How does
deferred_init
affect the process?
It ders all the networking and stuff in the background (usually the part that might slow the Task initialization process)
Also, is there a way of specifying a blacklist instead of a whitelist of features?
BurlyPig26 you can while list per framework and file name, exampletask = Task.init(..., auto_connect_frameworks={'pytorch' : '*.pt', 'tensorflow': ['*.h5', '*.hdf5']} )
What am I missing ?
Hi @<1532532498972545024:profile|LittleReindeer37>
This is truly a great discussion to have. Personally I think the main difference is that software development is a somewhat linear process , and git captures it very well. But ML is a lot wider nonlinear process, which to me means that trying to conform the same workflow into a Dev tree will end up failing. The way ClearML thinks about it (and I think the analogy to source control is correct ) is probably closer to how you think about proj...
Hi @<1523703472304689152:profile|UpsetTurkey67>
You mean https://github.com/Lightning-AI/torchmetrics
?
Where are those stored?
BitingKangaroo95 nice work π
I think that what did it was:
change the sshd_config
so that it allows port forwarding
, agent forwarding
and x11 forwarding
But just in case, it might be there was a pre existing SSH identifier on your machine, and hence the error.
clear known_hosts under ~/.ssh was also something I would try π