Reputation
Badges 1
25 × Eureka!Can you clone the git with the .ssh credentials on the host machine ?
If so, can you do the same manually inside a docker (i.e. spin a docker with mount -v /home/hostuser/.ssh:/root/.ssh) ?
JitteryCoyote63
So there will be no concurrent cached files access in the cache dir?
No concurrent creation of the same entry π It is optimized...
SmallDeer34 the function Task.get_models() incorrectly returned the input model "name" instead of the object itself. I'll make sure we push a fix.
I found a different solution (hardcoding the parent tasks by hand),
I have to wonder, how does that solve the issue ?
but actually that path doesn't exist and it is giving me an error
So you are saying you only uploaded the "meta-data" i.e. a text file with links to the files, and this is why it is missing?
Is there a way to change the path inside the .txt file to clearml cache, because my images are stored in clearml cache only
I think a good solution would be to store the path in the txt file as relative path, i.e. instead of /Users/adityachaudhry/data/folder... as ./data/folder
I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
Whatβs the general pattern for running a pipeline - train model, evaluate metrics and publish the model if satisfactory (based on a threshold, for example)
Basically I would do:
parameters for pipeline:
TaskA = Training model Task (think of it as our template Task)
Metric = title/series/sign we want to choose based on, where sign is max/min
Project = Project to compare the performance so that we could decide to publish based on the best Metric.
Pipeline:
Clone TaskA Change TaskA argu...
WickedGoat98 if this is the case, you can check this example. Same idea only "manual":
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
The difference is that running the agent in daemon mode, means the "daemon" itself is a job in SLURM.
What I was saying is pulling jobs from the clearml queue and then pushing them as individual SLURM jobs, does that make sense ?
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent or if running code that uses trains will work on them?
Hi TrickySheep9
Long story short, clearml-session fully supports k8s (using k8s glue)
The --remote-gateway along side ports mode will basically allow you to setup a k8s service so that every session will register with a specific port so k8s does ingest foe you and route the SSH connection to the pod itslef, everything else is tunneled over the original SSH connection.
Make sense ?
WackyRabbit7 hmmm seems like non regular character inside the diff.
Let me check something
The experiment finished completely this time again
With the RC version or the latest ?
Hi RoundMosquito25
The main problem here is there is no way to know before running the Task how much memory it would need ... And without that parameter maximizing GPUs is quite challenging. wdyt?
Is it being used to ssh to the instance?
It is used for the SSH client so it "knows" the SSH server (does that make sense) ?
but I still clearml-agent will raise the same error
which one?
Hmm I just tested on the community version and it seems to work there, Let me check with frontend guys. Can you verify it works for you on https://app.community.clear.ml/ ?
Hi ExcitedCat13
Sure, download the plugin from the git repo (Install instructions in the repo).
Regarding remote debugging, are referring to ssh ?
The plugin itself is designed to make sure that when you work on a remote machine with pycharm clearml will log the local git repo and changes (as the .git folder is not synced to the remote machine)
WittyOwl57 could it be the EC2 instance is too small (i.e. not enough storage / memory) ?
I understand I can change the docker image for a component in the pipeline, but for the
it isnβt possible.
you can always to Task.current_task.connect() from the pipeline function itself, to connect more configuration arguments you basically add via the function itself, all the pipeline logic function arguments become pipeline arguments, it's kind of neat π regrading docker, the idea is that you use a very basic python docker (the default for services) queue for all...
StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?
That's the right place but
like you would use hydra --override, which in your case I think it should be "accelerator.gpu" ,
You can also change allow_omegaconf_editin the UI to True, and then you could just edit the OmegaConf in the UI (if you do not changeallow_omegaconf_edit` then the edit in the UI is ignored)
SmarmySeaurchin8 regrading (2)
I'm not sure the current visualization supports it. I mean we can put "{}", but that would imply you can edit it, which then we have to support, possible but weird, and this is why:task.connect({'a':{},'b': {'nested': 'value}}will become'a' = '{}''b/nested' = 'value'
But then if you edit to:'a' = '{'nested': 'value'}''b/nested' = 'value'
you have two different ways of presenting the same type of structure...
Hi DepressedChimpanzee34
if you try to extend it more then the width of the column to the right, it doesn't do anything..
You mean outside of the window? or are you saying you cannot extend it?
Just verifying, we are talking about the latest version of clearml-server ?
Hmm I tested on chromium and it seemed to work, let me see if I can reproduce it...
Hi @<1558624430622511104:profile|PanickyBee11>
You mean this is not automatically logged? do you have a callback that logs it in HF?
Hi FierceFly22
Hi, does anyone know where trains stores tensorboard data
Tesnorboard data is stored wherever you point your file-writer to π
What trains is doing is while tensorboard writes it's own data to disk, it takes the data (in-flight) and sends it to the trains-server. The trains-server puts everything in the DB, so later everything is viewable & searchable.
Basically you don't need to store your TB files after your experiment is done, you have all the data in the trains-s...
Wait, how do I reproduce it on community server? Maybe it has something to do with number of columns ? Or whether it is already wider than the screen? What's your browser / OS ?
Hi WittyOwl57
That's actually how it works (original idea/design was borrowed from libclound), basically you need to create a Drive, then the storage manger will use it.
Abstract class here:
https://github.com/allegroai/clearml/blob/6c96e6017403d4b3f991f7401e68c9aa71d55aa5/clearml/storage/helper.py#L51
Is this what you had in mind ?
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything.
Sure man π no rush, I appreciate the gesture regardless of the outcome
Many thanks!