Reputation
Badges 1
25 × Eureka!Hi @<1533620191232004096:profile|NuttyLobster9>base_task_factory
is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
return task
Make sense ?
Hi GracefulDog98
The agent will map the ~/.ssh folder automatically into the docker's /root/.ssh
It will also convert http links to ssh pull if you set force_git_ssh_protocol
in your clearml.conf :
https://github.com/allegroai/clearml-agent/blob/351f0657c3dcf707659875d7e0a52fa387709978/docs/clearml.conf#L25
EnviousPanda91
in your clearml.conf I think you are missing a sectionagent.git_user="" agent.git_pass="" agent.git_host="" agent.force_git_ssh_protocol: true
You can see in the log it tries to download an artifact from a specific IP:URL is that link a valid one?
(this seems like the main cause of the error, first line in the screenshot)
Hi ReassuredTiger98
Could you add some print ? before / after the artifact upload?
Also what's the clearml version you are using ?
Yep I think you are correct, you should have had the same output as a local jupyter notebook, and it seems that in sagemaker studio it is not working π
Let me check something
BitingKangaroo95 can you post here the entire console output of clearml-session (including full command line) ?
currently I'm doing it by fetching the latest dataset, incrementing the version and creating a new dataset version
This seems like a very good approach, how would you improve ?
HandsomeCrow5client.events.debug_images(metrics=[dict(task='6adb929f66d14731bc76e3493ab89d80', metric='image')])
"regular" worker will run one job at a time, services worker will spin multiple tasks at the same time But their setup (i.e. before running the actual task) is one at a time..
OutrageousGiraffe8 so basically replacing to:self.d1 = ReLU()
What's the trains version / trains-server version ?
See here:
https://pip.pypa.io/en/stable/user_guide/#environment-variables
Pass these environment variables as part of the YAML template you are using with the k8s.
Should work for both π
Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same
Nice! TrickySheep9 any chance you can share them ?
Hi @<1569858449813016576:profile|JumpyRaven4>
- The gunicorn logs do not show anything including any error or trace of the 502 only siege reports the 502 as well as the ALB.Is this an ALB or an ELB ?
What's the timeout its configured?
Do you have GPU instances as well? what's theclearml-serving-inference
docker version ?
What if I register the artifact manually?
task.upload_artifact('local folder', artifact_object='
')
This one should be quite quick, it's updating the experiment
I don't have the compose file, or at least can't seem to find it inΒ
/opt
you can manually take down all dockers with:docker ps
then docker stop <container id>
for each container id
What happens when you call:
from clearml.backend_interface.task.repo import ScriptInfo
print(ScriptInfo._ScriptInfo__legacy_jupyter_notebook_server_json_parsing(None))
Hmm would uploading it as YAML string be better?
Hi @<1523701868901961728:profile|ReassuredTiger98> when you get to it...
please download the wheel, then install it with
pip3 install -U clearml_agent-0.17.3rc0-py3-none-any.whl
Then run the daemon with the additional --debug
argument, basically:
clearml-agent --debug daemon --foreground ...
Once the agent is running please send the Task's log from your console π
Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.
This is exactly what should happen, are you saying that for some reason it fails?
Wait @<1715900788393381888:profile|BitingSpider17> are you passing it on a single Task? these values are read by the daemon (i.e. running on the host) which means it is not getting them from the Task context (which leads to zero effect on the mount points)
Notice that in new versions of the clearml-agent the SDK mount point was changed to: sdk_cache: "/clearml_agent_cache"
exactly to solve for the non-root containers:
[None](https://github.com/allegroai/clearml-agent/blob/6b31883e4579...
I get the same "white" image in both TB & ClearML π