Let me know if I can be of help π
I would like to start off by saying that I absolutely love clearml.
@<1547028031053238272:profile|MassiveGoldfish6> thank you for saying that! π
Is is possible to download individual files from a dataset without downloading the entire dataset? If so, how do you do that?
Well by default files are packaged into multiple zip files, you can control the size of the zip file for finer granularity, but at the end when you download, you are downloading the entire packaged ...
so I didn't have much time to upgrade all the packs because I have some issues with that but it is on my todo list
No worries π
Quick question, if you run https://github.com/allegroai/trains/blob/master/examples/frameworks/keras/legacy/keras_tensorboard.py
Do you see models in the artifacts tab?
WickedGoat98 nice!!
Can you also pass the login screen (i.e. can you access the api server)
Let me know if you managed to get it working, then we can see if we can detect it automatically.
Hi @<1523711619815706624:profile|StrangePelican34>
if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,
Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent
i had a misconception that the conf comes from the machine triggering the pipeline
Sorry, this one :)
Hi SkinnyPanda43
Are you trying to access the same Task or an external one ?
Hi @<1529633468214939648:profile|CostlyElephant1>
Is it possible to get user ID of the current user
On the Task.data object itself there should be a filed named " user " that's the user ID of the owner (creator) of the Task.
You can filter based on this id with
Tasks.get_tasks(..., task_filter={'user': ["user-id-here"]})
wdyt?
BTW: the agent will resolve pytorch based on the install CUDA version.
Hi @<1573119962950668288:profile|ObliviousSealion5>
Hello, I don't really like the idea of providing my own github credentials to the ClearML agent. We have a local ClearML deployment.
if you own the agent, that should not be an issue,, no?
forward my SSH credentials using
ssh -A
and then starting the clearml agent?
When you are running the agent and you force git clonening with SSH, it will autmatically map the .ssh into the container for the git to use
Ba...
Yes... I think that this might be a bit much automagic even for clearml π
that does happen when you create a normal local task, that's why i was confused
The parts that are not passed in both cases are the configurations from the conf file. Only the environment is passed (e.g. git python packages etc) , . For example if you have storage credentials in your conf file , they are not passed to a remote agent, instead the credentials from the remote agent are used when it runs the task.
make sense?
So could it be that pip install --no-deps . is the missing issue ?
what happens if you add to the installed packages "/opt/keras-hannd" ?
btw: both should work fine
owning the agent helps, but still it's much better if the credentials don't show up in logs,
They are not, they are always filtered out,
- how does
force_git_ssh_protocolhelp please? it doesn't solve the issue of the agent simply not having accessIt automatically maps the host .ssh into the container, so that git can use SSH to clone.
What exactly is not working?
and how are you configuring it?
Hi @<1547028031053238272:profile|MassiveGoldfish6>
Is there a way for ClearML to simply save the model once training is done and to ignore the model checkpoints?
Yes, you can simple disable the auto logging of the model and manually save the checkpoint:
task = Task.init(..., auto_connect_frameworks={'pytorch': False}
...
task.update_output_model("/my/model.pt", ...)
Or for example, just "white-label" the final model
task = Task.init(..., auto_connect_frameworks={'pyt...
Can you clone the git with the .ssh credentials on the host machine ?
If so, can you do the same manually inside a docker (i.e. spin a docker with mount -v /home/hostuser/.ssh:/root/.ssh) ?
JitteryCoyote63
So there will be no concurrent cached files access in the cache dir?
No concurrent creation of the same entry π It is optimized...
SmallDeer34 the function Task.get_models() incorrectly returned the input model "name" instead of the object itself. I'll make sure we push a fix.
I found a different solution (hardcoding the parent tasks by hand),
I have to wonder, how does that solve the issue ?
but actually that path doesn't exist and it is giving me an error
So you are saying you only uploaded the "meta-data" i.e. a text file with links to the files, and this is why it is missing?
Is there a way to change the path inside the .txt file to clearml cache, because my images are stored in clearml cache only
I think a good solution would be to store the path in the txt file as relative path, i.e. instead of /Users/adityachaudhry/data/folder... as ./data/folder
I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
Whatβs the general pattern for running a pipeline - train model, evaluate metrics and publish the model if satisfactory (based on a threshold, for example)
Basically I would do:
parameters for pipeline:
TaskA = Training model Task (think of it as our template Task)
Metric = title/series/sign we want to choose based on, where sign is max/min
Project = Project to compare the performance so that we could decide to publish based on the best Metric.
Pipeline:
Clone TaskA Change TaskA argu...
WickedGoat98 if this is the case, you can check this example. Same idea only "manual":
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
The difference is that running the agent in daemon mode, means the "daemon" itself is a job in SLURM.
What I was saying is pulling jobs from the clearml queue and then pushing them as individual SLURM jobs, does that make sense ?
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent or if running code that uses trains will work on them?
Hi TrickySheep9
Long story short, clearml-session fully supports k8s (using k8s glue)
The --remote-gateway along side ports mode will basically allow you to setup a k8s service so that every session will register with a specific port so k8s does ingest foe you and route the SSH connection to the pod itslef, everything else is tunneled over the original SSH connection.
Make sense ?
WackyRabbit7 hmmm seems like non regular character inside the diff.
Let me check something