
Reputation
Badges 1
25 × Eureka!PricklyRaven28 did you set the iam role support in the conf?
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/docs/clearml.conf#L86
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
docker + nvidia-docker-runtime are assumed to be installed
nvidia/cuda docaker image is pulled when requested (like any other container image)
Moreover, since I'm going to use
Task.execute_remotely
(and not through the UI) is there any code way to specify the docker image to be used?
Sure, task.set_base_docker(docker_cmd='nvidia/cuda -v /mnt:/tmp')
Notice that you can not only pass the dock...
VexedCat68 both are valid. In case the step was cached (i.e. already executed) the node.job will be None, so it is probably safer to get the Task based on the "executed" field which stores the Task ID used.
In the documentation it warns about
.close()
"Only call Task.close if you are certain the Task is not needed."
Maybe this is not clear enough, this means you do not need to automatically Add/Log/Track things into the Task in the current process.
This does Not mean you cannot access the Task or its artifacts
Mark closed means to externally (i..e not from the process that crated the Task, maybe even from a different machine) close and mark the task as completed (this...
Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue π
BTW: how are you using them? should we have a direct interface to those ?
Hi TrickyFox41
is there a way to cache the docker containers used by the agents
You mean for the apt get install part? or the venv?
(the apt packages themselves are cached on the host machine)
for the venv I would recommend turning on cache here:
https://github.com/allegroai/clearml-agent/blob/76c533a2e8e8e3403bfd25c94ba8000ae98857c1/docs/clearml.conf#L131
For example, could you test if this one works:
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
Hi ShortElephant92
This isn't an issue if the user is using a Service Account JSON Key,
Are you saying that when you are using GS python sdk directly it works?
For context, the google cloud storage SDK allows an authorized user credentials.
ClearML actually uses the google python SDK, the JSON is just a way to pass the credentials to the google SDK, I'm not sure it points to "service account"? where did that requirement came from ?
is it from here ` Service account info was n...
seems it was fixed π
MagnificentWorm7 thank you for noticing ! π
Hi WickedGoat98
This sounds like a great design (obviously you have scale in mind π ) Feel free to ask "stupid" questions, based on what you already wrote I doubt they will be
A few questions that come to mind (probably a few others after):
You mentioned FS synchronization, from where? i.e. what is the single source of truth ? K8s (Rancher 2.0 is basically k8s manager) can take care of mounting volumes, so no need to sync, is this a valid solution ?
BTW : (you can drag and drop an i...
Hi LazyFish41
Could it be some permission issue on /home/quetalasj/.clearml/cache/
?
SarcasticSquirrel56
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?
Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?
EDIT:
I guess to achieve what I w...
On my to do list, but will have to wait for later this week (feel free to ping on this thread to remind me).
Regrading the issue at hand, let me check the requirements it is using.
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
Nice!
But I can't really figure out why that would be the case...
So the thing is, the link to the files are generated by the clients, which means the actual code generated a link an internal link to the file server (i.e. a link that only works inside the k8s cluster). When you wanted to see the image/plot you were accessing it from outside the cluster, and the link simply ...
i've tried setting up a clearml application on openshift
First, my condolences π openshift ...
Second, what you need to make sure is that each container (i.e. ELK/Monogo etc) has their own PV for persistent storage , I'm assuming this is the root cause for the error.
Make sense to you ?
SoreDragonfly16 . In the hyper parameters Tab, you have "parallel coordinates" (next to the "add experiment" the button saying "values" press on it and there should be " parallel coordinates")
Is that it?
because comparing experiments using graphs is very useful. I think it is a nice to have feature.
So currently when you compare the graphs you can select the specific scalars to compare, and it Update in Real Time!
You can also bookmark the actual URL and it is fully reproducible (i.e. full state is stored)
You can also add custom columns to the experiment table (with the metrics) and sort / filter based on them, and create a summary dashboard (again like ll pages in the web app, URL is...
Hmmm, are you running inside pycharm, or similar ?
Could you maybe send a screenshot? This is very strange? Also what's the trains version?
Oh, I was assuming you are passing the entire DB backups to the cloud.
Are you saying you just want the file server on the cloud ? if this is the case, I would just use S3
This is the thread checking the state of the running pods (and updating the Task status, so you have visibility into the state of the pod inside the cluster before it starts running)
Can you print the actual values you are passing? (i.e. local_file
remote_url
)
BTW: this is probably more efficient than pickling
https://pandas.pydata.org/pandas-docs/version/1.1.5/reference/api/pandas.DataFrame.to_parquet.html
Hi @<1798887585121046528:profile|WobblyFrog79>
. When I execute the pipeline remotely in Kubernetes, those components
two things, one, make sure you specify the repo you need the components from in the decorator function, what will happen is the repo will be cloned into the container running on k8s, then inside the repo root your script (i.e. pipeline step) will be running.
[None](https://github.com/clearml/clearml/blob/9c93aa9e538075c848647dcd88e3e12bec051b5f/clearml/automation/con...
Hi ShallowArcticwolf27
from the command line to a remote machine while loading a localΒ
.env
Β file as a configuration object?
Where would the ".env" go to ? Are we trying to pass it to the remote machine somehow ?