Reputation
Badges 1
25 × Eureka!Hi QuaintPelican38
Assuming you have open the default SSH port 10022 on the ec2 instance (and assuming the AWS premissions are set so that you can access it). You need to use the --public-ip
flag when running the clearml-session. Otherwise it "thinks" it is running on a local network and it registers itself with the local IP. With the flag on it gets the public IP of the machine, then the clearml-session running on your machine can connect to it.
Make sense ?
And you cannot see it in Trains UI?
Thanks JitteryCoyote63 !
Any chance you want to open github issue with the exact details or fix with a PR ?
(I just want to make sure we fix it as soon as we can 🙂 )
I think you are onto a good flow, quick iterations / discussions here, then if we need more support or an action-item then we can switch to GitHub. For example with feature requests we usually wait to see if different people find them useful, then we bump their priority internally, this is best done using GitHub Issues 🙂
Makes total sense!
Interesting, you are defining the sub-component inside the function, I like that, this makes the code closer to how this is executed!
task = Task.get_task('task_id_here') task.mark_started(force=True) task.upload_artifact(..., wait_on_upload=True) task.mark_completed()
I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)
Yes the one you create manually is not really of the same "type" as the one you create online, this is why you do not see it there 😞
If you edit the requirements to have
https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl
No I was was pointing out the lack of one
Sounds like a great idea, could you open a github issue (if not already opened) ? just so we do not forget
set the pytorch lightning trainer argument
log_every_n_steps
to
1
(default
50
) to prevent the ClearML iteration logger from timing-out
Hmm that should not have an effect on the training time, all logs are send in the background, that said checkpoints might slow it a bit (i.e.; i...
eval
built-in. wdyt?
eval
is never recommended as basically you could do Args/float='os.system("rm ...")'
🙂
In theory type is stored on the hyper parameter (this is a relatively new feature the backend supports)
The casting though, is done based on the Original value type, which means Task.connect needs to be called with the original dict. Is there a specific reason for using get_parameters instead of task.connect ?
Hi ElegantCoyote26 , in theory no limit, but that depends on how you spined the services queue agent:
https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_daemon
See services mode
:
To limit the number of simultaneous tasks run in services mode, pass the maximum number immediately after the
--services-mode
option (e.g.
--services-mode 5
)
so you have a repo with poetry that some users update and some do not?
All working on the same branch ?
Oh :)task.get_parameters_as_dict()
you mean The Task already exists or you want to create a Task from the code ?
FlatStarfish45
In the parent task, the libs appear installed.
What do you mean by "parent Task"? Is this the base task we are optimizing (i.e. the experiment / model we are optimizing) ?
Or is it the "Optimization Task" itself?
Hi DrabCockroach54
Do we know if gpu_0_mem_usage and gpu_0_mem_used_gb, both shows current GPU usage?
the first is percentage used (memory % used at any specific moment) and the second is memory used GiB , both for the video memory
How to know from this how much GPU is reserved for the task if this task is in progress?
What do you mean by how much is reserved ? Are you running with an agent?
but when I run the same task again it does not map the keys.. (edited)
SparklingElephant70 what do you mean by "map the keys" ?
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
So just to be clear - the file server has nothing to do with the storage?
Think of it as a quick and dirty "minio", storing files and serving them over http. If you have minio (or any object storage) you can replace it all together 🙂
I think I found something, let me test my theory
Hi @<1624941407783358464:profile|GrievingTiger47>
I think you should try to contact the sales guys here: None
Hi @<1554275779167129600:profile|ProudCrocodile47>
Do you mean @ clearml.io ?
If so, then this is the same domain (.ml is sometimes flagged as spam, I'm assuming this is why they use it)
btw:# in another process
How do you spin the subprrocess, is it with Popen ?
also what's the OS and python version you are using?
the issue moving forward is if we restart the pod we will have to manually update that again.
Can't you map the nginx configuration file ? (making the changes persistent across pods)
so if the node went down and then some other node came up, the data is lost
That might be the case. where is the k8s running ? cloud service ?
p.s. you should remove this line 🙂extra_index_url: ["git@github.com:salimmj/xxxx"]