
Reputation
Badges 1
25 × Eureka!Hi BroadMole98
A bit hacky but doable πtask = Task.get_task(task_id='aabbcc') task.get_logger().report_scalar(...)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
Thatβd include monitoring and alerting. Iβm afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow...
and when you remove the "." line does it work?
Maybe different API version...
What's the trains-server version?
I think this is due to the label map including some keys with aΒ
.
Β in them.
Hi TenseOstrich47 what do you mean "label"
We should probably add (set_task_type :))
Hi BattyLion34
No problem asking here π
Check your ~/clearml.conf or ~/trains.conf :
There is a section names api, under it you will find the definition of your trains-server π
No, they're not in Tensorboard
Yep that makes sense
Logger.current_logger().report_scalar("test", test_metric, posttrain_metrics[test_metric], 0)
That seems like a great solution
Okay, this is odd the request returned exactly 100 out 100.
It seems not all of them were reported?!
Could you post the toy code, I'll check what's going on.
MysteriousBee56 I would do Task.create()
you can get the full Task internal representation with task.data
Then call task._edit(script={'repo': ...}) to edit/update all the Task entries.
You can check the dull details of the task object here: https://github.com/allegroai/trains/blob/master/trains/backend_api/services/v2_8/tasks.py#L954
BTW: when you have a sample script working, consider PR-ing it, I'm sure it will be useful for others π (also a great way to get us involved with debuggin...
One additional thing to notice, docker will Not actually limit the "vioew of the memory" it will just kill the container if you pass the memory limit, this is a limitation of docker runtime
BTW: get_tasks has project_name argument, I would just use it π
Then this is by default the free space on the home folder (`~/.clearml') that is missing free space
If you use this one for example, will the component have pandas as part of the requirement
None
def step_two(...):
import pandas as pd
# do stuff
If so (and it should), what's the difference, where is "internal.repo " different from pandas ?
SmugOx94 could you please open a GitHub issue with this request, otherwise we might forget π
We might also get some feedback from other users
Okay so my thinking is, on the pipelinecontroller / decorator we will have:abort_all_running_steps_on_failure=False
(if True, on step failing it will abort all running steps and leave)
Then per step / component decorator we will havecontinue_pipeline_on_failure=False
(if True, on step failing, the rest of the pipeline dag will continue)
GiganticTurtle0 wdyt?
The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?
Okay that look s good, now in the UI start here and then get to the artifacts Tab,
Is it there ?
Hi @<1569858449813016576:profile|JumpyRaven4>
What's the clearml-serving version you are running ?
This happens even though all the pods are healthy and the endpoints are processing correctly.
The serving pods are supposed to ping "I'm alive" and that should verify the serving control plan is alive.
Could it be no requests are being served ?
My only point is, if we have no force_git_ssh_port
or force_git_ssh_user
we should not touch the SSH link (i.e. less chance of us messing with the original URL if no one asked us to)
In that case you should probably mount the .ssh
from the host file-system into the docker. for example:docker run -v /home/user/.ssh:/root/.ssh ...
WickedGoat98 the above assumes your are running the docker manually, if you are using docker-compose.yml file the same mount should be added to the docker-compose.yml
GentleSwallow91 notice this part:
Hi Martin. Sorry - missed your reply.
Yeap I am aware that docker_internal_mounts is inside agent section.
'-v', '/tmp/ssh-XXXXXXnfYTo5/agent.8946:/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-XXXXXXnfYTo5/agent.8946',
It is creating a copy of the ssh folder and setting the SSH_AUTH_SOCK env to it. You can just map the entire ssh folder automatically by un-setting SSH_AUTH_SOCK before running the agent.SSH_AUTH_SOCK= clearml-agent ...
now, I need to pass a variable to the Preprocess class
you mean for the construction ?
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
EnviousStarfish54 good news, this is fully reproducible
(BTW: for some reason this call will pop the logger handler clearml installs, hence the lost console output)
If it helps, you can override it on the clients with an OS environment CLEARML_FILES_HOST
GreasyPenguin14 whats the clearml version you are using, OS & Python ?
Notice this happens on the "connect_configuration" that seems to be called after the Task was closed, could that be the case ?