
Reputation
Badges 1
32 × Eureka!Yes I think it's only related to the UI. Do you think It can be fixed somehow? It would be the easiest way to launch new experiments with a different configuration
Back to the feature request, if this is taken care of (both adding a missed package, and the S3 upload), do you still believe there is a room for this kind of feature?
Well, I can set import(s3fs)
even if I don't really use it in my own code. One problem could be if this happen for a lot of packages, therefore I'd need to add this import to all my entry points of all my repos. While if I just download the right packages from the requirements.txt than I don't need to think about...
Does it work if I launch the clearml-agent on a docker and pip doesn't know the packages to install?
Yes it does 👍 Btw, at the moment I added import(s3fs) in my entry point and it's working, thank you!
Hi AgitatedDove14 , do you mean the the k8s glue autoscaler here https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py ? If yes, I understood that this service deploys pods on the nodes in the cluster, but I'd prefer to have a new instance deployed for each new experiment and that it also terminates when no new experiments are queued
I've just seen it is a know issue https://clearml.slack.com/archives/CTK20V944/p1611763839133700 . Has a new version been released meanwhile?
AgitatedDove14 that seems like the best option. Once the aws autoscaler is inside a docker container I can deploy it inside a kube pod or a job. This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Please let me know if my explanation is not really clear
Nice, I didn't know that 🙂
If it can help understand, this is what I'm doing
Hi AgitatedDove14
I implemented the pipeline manually as you suggested. I also used task.wait_for_status() after each task.enqueue() so I was able to implement a full pipeline in one script. It seems to be working correctly. Thank you!
FriendlySquid61 Your solution seems to have solved the problem. But only after I removed the export CLEARML_API_HOST={api_server}
export CLEARML_WEB_HOST={web_server}
export CLEARML_FILES_HOST={files_server}
command from the bash script executed when the instance is launched
I also removed 'sudo' from all the commands as is suggested in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html but that wasn't the cause of the problem
Hi AgitatedDove14 , what I meant is that if it is possible to associate ec2 instances of the autoscaler to a IAM role in order to grant permissions to applications running on that instances, which could be for example the access to a s3 buckets that can be accessed only with a certain IAM role permissions. I'm not completely sure that what I'm saying makes sense, but I refer to something similar as it's specified here https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role...
Hi AgitatedDove14 , FriendlySquid61 ! I managed to grant permission to the AWS autoscaler to spin instances using the instance profile as suggested by FriendlySquid61 . The instances are created and terminated correclty, however the new instances don't executed the queued task and shutdown immediately. I noticed that the clearml credential atself.web_server = Session.get_app_server_host()
self.api_server = Session.get_api_server_host()
` self.files_server = S...
"Pytorch Lightning need the s3fs " s3fs is not needed, let PL store the model locally and use "output_uri" to automatically upload the model to your S3 bucket.
So I can set output_uri = "s3://<bucket_name>/prefix" and the local models will be loaded into the s3 bucket by ClearML ?
As an example, in Task.create() there is the possibility to install packages using a requirements.txt, and if not specified, it uses the requirements.txt of the repository. I'd like something like for Task.init() if possible
if in the "installed packages" I have all the packages installed from the requirements.txt than I guess I can clone it and use "installed packages"
Make sure you have the S3 credentials in your agent's clearml.conf
Ok this could be a problem, as right now I'm using ec2-instances with a instance-profile (I use it in the autoscaler) so they have by the default the right s3 permissions. But I'll try it anyway
Because at the moment I'm having a problem with the s3fs package where I have it in my requirements.txt but the import manager at the entry point doesn't install it
My problem right now is that Pytorch Lightning need the s3fs package to store model checkpoint into s3 buckets, but in my "installed packages" is not imported and I get an import error
Hi AgitatedDove14 , thank you for your answer!
At the moment I can't configure both internal/external with the same dns. Before changing the server infrastructure, i'm trying a workaround where I upload the artifact with the internal file server path, and then I upload a string artifact which is the first artifact url where I replace the internal dns with the external dns, and use it to download the artifact from the UI.
Hi AgitatedDove14 , I'm interested in this feature to run the agent and force it to install packages from requirements.txt. Is it available?
Hi TimelyPenguin76 , I tried your approach and it works, thank you! However it's a bit different to what I was trying to do: instead of cloning an existing task I'd like to specify the repository and a specific commit tag to use as it is done in Task.create. If this is possible with the API client it would be perfect
Hi TimelyPenguin76 , I used api_client.tasks.create
and It works, thank you!
I created this toy example so you don't need any external conf files. Btw if I first launch the task as python example.py port=80
than the task will print the message "Is this a webserver" correctly. If then in the UI I clone the same task, overrides the port with ['port=43']
, for example, and run the experiment, I will still get the message "Is this a webserver" so the port didn't change
Hi AgitatedDove14 , sorry for the late reply. Btw, I tried with the latest RC and the issue is still there. So if I clone an experiment, modify an overrides params eg ['training.max_epochs=10']
my experiment run the old configuration. Therefore it seems that it doesn't change the OmegaConf configuration.