
Reputation
Badges 1
32 × Eureka!AgitatedDove14 that seems like the best option. Once the aws autoscaler is inside a docker container I can deploy it inside a kube pod or a job. This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Yes the workaround it's working 🙂
It's working correctly, thank you!
Yes it was set to nvidia/cuda:10.1-runtime-ubuntu18.04... ok I'll try again and see if that was the problem, thank you
Hi AgitatedDove14 , you can try with this toy example. If i run the task with python example.py ui.width=2048
the task will run correctly and print Title=My app, size=2048x768 pixels
. However, in the UI I'm not allowed to change the ui.width in the Hydra parameters section: the 'Save' button is frozen
` from clearml import Task
from dataclasses import dataclass
import hydra
from hydra.core.config_store import ConfigStore
from omegaconf import OmegaConf
@dataclass
class MySQLConfig:
host: str = "localhost"
port: int = 3306
@dataclass
class UserInterface:
title: str = "My app"
width: int = 1024
height: int = 768
@dataclass
class MyConfig:
db: MySQLConfig = MySQLConfig()
ui: UserInterface = UserInterface()
cs = ConfigStore.instance()
cs.store(name="config", n...
Actually I had the same issue even with that value set to False
Yes I think it's only related to the UI. Do you think It can be fixed somehow? It would be the easiest way to launch new experiments with a different configuration
Hi TimelyPenguin76 , I tried your approach and it works, thank you! However it's a bit different to what I was trying to do: instead of cloning an existing task I'd like to specify the repository and a specific commit tag to use as it is done in Task.create. If this is possible with the API client it would be perfect
` # ClearML - Hydra Example
from clearml import Task
from dataclasses import dataclass
import hydra
from hydra.core.config_store import ConfigStore
from omegaconf import OmegaConf
@dataclass
class MySQLConfig:
host: str = "localhost"
port: int = 3306
cs = ConfigStore.instance()
Registering the Config class with the name 'config'.
cs.store(name="config", node=MySQLConfig)
@hydra.main(config_name="config")
def my_app(cfg: MySQLConfig) -> None:
# type (DictConfig) -> None
...
Hi TimelyPenguin76 , I used api_client.tasks.create
and It works, thank you!
FriendlySquid61 Your solution seems to have solved the problem. But only after I removed the export CLEARML_API_HOST={api_server}
export CLEARML_WEB_HOST={web_server}
export CLEARML_FILES_HOST={files_server}
command from the bash script executed when the instance is launched
Hi AgitatedDove14 , FriendlySquid61 ! I managed to grant permission to the AWS autoscaler to spin instances using the instance profile as suggested by FriendlySquid61 . The instances are created and terminated correclty, however the new instances don't executed the queued task and shutdown immediately. I noticed that the clearml credential atself.web_server = Session.get_app_server_host()
self.api_server = Session.get_api_server_host()
` self.files_server = S...
I also removed 'sudo' from all the commands as is suggested in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html but that wasn't the cause of the problem
Hi Sapir, no that didn't solve the problem unfortunately. I ssh into the machine (after removing shutdown so that it doesn't terminate) and from the log I saw the error : "clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server
http://apiserver:8008 ?
So it is a credential problem
Hi AgitatedDove14 , thank you for your answer!
At the moment I can't configure both internal/external with the same dns. Before changing the server infrastructure, i'm trying a workaround where I upload the artifact with the internal file server path, and then I upload a string artifact which is the first artifact url where I replace the internal dns with the external dns, and use it to download the artifact from the UI.
Nice, I'll try also with the extra_bash_script, thank you!
Hi AgitatedDove14 , sorry for the late reply. Btw, I tried with the latest RC and the issue is still there. So if I clone an experiment, modify an overrides params eg ['training.max_epochs=10']
my experiment run the old configuration. Therefore it seems that it doesn't change the OmegaConf configuration.
However, If I edit directly the OmegaConf in the UI than the port changes correctly. I'd still prefer to override the Args so I can change entire sub-configuration e.g. ['dataset=cifar']
to ['dataset=imagenet']
instead of having to change all the parameters inside the OmegaConf
I created this toy example so you don't need any external conf files. Btw if I first launch the task as python example.py port=80
than the task will print the message "Is this a webserver" correctly. If then in the UI I clone the same task, overrides the port with ['port=43']
, for example, and run the experiment, I will still get the message "Is this a webserver" so the port didn't change
Hi AgitatedDove14 , I noticed that in the Hydra parameters section it is not possible to add as parameters keys string with dots: .(dot) $(dollar) and space are not allowed in parameter key.
However, it's very useful to add parameters with the dot to change something in a sub-configuration as, for example, training.max_epochs=10
. Do you think it's possible to allow this?
Hi AgitatedDove14 , what I meant is that if it is possible to associate ec2 instances of the autoscaler to a IAM role in order to grant permissions to applications running on that instances, which could be for example the access to a s3 buckets that can be accessed only with a certain IAM role permissions. I'm not completely sure that what I'm saying makes sense, but I refer to something similar as it's specified here https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role...
Also, if I want to modify another parameter, e.g. ui.height I have this problem:
Ok now I noticed that If I change the value of the port inside the Hydra parameters section ( not the overrides) It does actually change also in the experiment. The overrides doesn't seem to be working
I've just seen it is a know issue https://clearml.slack.com/archives/CTK20V944/p1611763839133700 . Has a new version been released meanwhile?
If it can help understand, this is what I'm doing
Hi AgitatedDove14
I implemented the pipeline manually as you suggested. I also used task.wait_for_status() after each task.enqueue() so I was able to implement a full pipeline in one script. It seems to be working correctly. Thank you!