Reputation
Badges 1
25 × Eureka!ConfusedPig65 could you send the full log (console) of this execution?
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
Hi DilapidatedDucks58 ,
I'm not aware of anything of this nature, but I'd like to get a bit more information so we could check it.
Could you send the web-server logs ? either from the docker or the browser itself.
🙏 thank you so much @<1556450111259676672:profile|PlainSeaurchin97> !!!
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated 🙂
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
I think, this all ties into the none-standard git repo definition. I cannot find any other reason for it. Is it actually stuck for 5 min at the end of the process, waiting for the repo detection ?
What's the "working directory" ?
What's the trains-agent version?
(yes this should have worked, as long as the package "test" is there)
now it stopped working locally as well
At least this is consistent 🙂
How so ? Is the "main" Task still running ?
models been trained stored ...
mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.
- how can I enable the tensorboard and have the graphs been stored in trains?
Basically if you call Task.init all your...
Yeah I think using voxel for forensics makes sense. What's your use case ?
I mean test with:pipe.start_locally(run_pipeline_steps_locally=False)This actually creates the steps as Tasks and launches them on remote machines
AgitatedTurtle16 could you check with the latest clearml RC (I remember a similar issue was fixed).pip install clearml==0.17.5rc3Then run againclearml-task ...
this is the code for task scheduler
So it makes sense the first "scheduled" job is epoch time 0 (1970) because "executes_immediately" basically means it sets a date that passed, so it triggers it. does that make sense ?
Could not locate channel name 'gg_clearml'CheerfulGorilla72 these are the permissions:
https://github.com/allegroai/clearml/blob/427b98270cc846b5d7e4af49f9732e3eb8d7d3ae/examples/services/monitoring/slack_alerts.py#L13channels:join channels:read chat:write
My understanding is that on remote execution Task.init is supposed to be a no-op right?
Not really a no-op, it would sync Argpasrer and the like, start background reporting services etc.
This is so odd! literally nothing printed
Can you tell me something about the node "mrl-plswh100:0" ?
is this like a sagemaker node? we have seen things similar where Python threads / subprocesses are not supported and instead of python crashing it just hangs there
Questions
I want to trigger a retrain task when F1
That means that in inference you are reporting the F1 score, correct?
As part of the retraining I have to train all the models and then have to choose best one and deploy it
Are you using passing output_uri to Task.init? are you storing the model as artifact?
You can tag your model/task with "best" tag (and untag the previous one). Then in production , look for the "best" task and get its model
Thoughts?
Back to the feature request, if this is taken care of (both adding a missed package, and the S3 upload), do you still believe there is a room for this kind of feature?
Hi GiddyTurkey39
us the config file connect to the Task via Task.connect_configuration ?
Hi UptightMouse31
First, thank you 😊
And to your question:
variable in the project is the kpi,
You mean like add it to the experiment table and get kind of leader-board ?
Damn, okay I'll make sure we fix the order.
Could you verify the ~= works as intended (if the order id correct)
MagnificentSeaurchin79
Can this be solved by using a docker image with the preinstalled packages at a user level?
Yes 🙂
BTW: I think I missed how you managed to install the object_detection API in the first place?
Is it the git repo of the Task? did you fork it? is it a submodule of your git repo?
p.s.
Yes Slack is quite good at reminding you, but generally saying always prefer @ , it will send me an email if I miss the message :)
That makes total sense, this is exactly an OS scenario for signal 9 🙂
So it makes sense it installs v8.0.1
(maybe originally you provided no version and it installed the latest one)
This is basically pip's doing the package version resolving
Well, PipelineDecorator actually allows you to do the same thing, with the same ability that is clone / modify / enqueue.
(I mean, Pipeline with tasks is also great, I just want to clarify that they have the same capabilities in this respect).