Reputation
Badges 1
25 × Eureka!. I can't find any actual model files on the server though.
What do you mean? Do you see the specific models in the web UI? is the link valid ?
RipeGoose2 yes that will work 🙂
That said, we should probably fix the S3 credentials popup 😉
- but the
pytorch/main.py
file doesn't run.
What do you have on the Task itself? is this the correct script ?
Any chance you can send a full log ? (you can DM it if it helps)
Hmm DepressedChimpanzee34 my bad it seems the loading is done via YAML loader, but the dumping is straight forward str casting...
https://github.com/allegroai/clearml/blob/6e6271fb91f2aeb2aa7a13c6d07d4e635baaa670/clearml/backend_interface/task/task.py#L934
What would you expect to get (BTW "value\blah"
is Not a valid string assignment in python as there is no \b escape character, it should be "value\blah" which translates into the text "value\blah")
Hmm, can you send the full log of the pipeline component that failed, because this should have worked
Also could you test it with the latest clearml python version (i.e. 1.10.2)
PompousBeetle71 notice that starting with this version when you set model tags they will be stored as user tags , which you can change and edit in UI. So if you still need the system tags you have to access them directly.
Hi WackyRabbit7 ,
Yes we had the same experience with kaggle competitions. We ended up having a flag that skipped the task init :(
Introducing offline mode is on the to do list, but to be honest it is there for a while. The thing is, since the Task object actually interacts with the backend, creating an offline mode means simulation of the backend response. I'm open to hacking suggestions though :)
Do you think the local agent will be supported someday in the future?
We can take this ode sample and extent it. can't see any harm in that.
It will enable very easy to ran "sweeps" without any "real agent" installed.
I'm thinking roll out multiple experiments at once
You mean as multiple subprocesses, sure if you have the memory for it
- Agent on laptop, Server on Kube - Fail
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube
Yeah that feels like a network config issue...
Is there a verbose setting in the agent that could help us diagnose,
yes running with debug turned on on.
since you managed to reproduce on your latop you can try to run the agent with --debug to test, specifically:
clearml-agent --debug daemon ....
if you are running it in venv mode (which I think ...
CurvedHedgehog15 the agent has two modes of opration:
single script file (or jupyter notebook), where the Task stores the entire file on the Task itself. multiple files, which is only supported if you are working inside a git repository (basically the Task stores a refrence to the git repository and the agent pulls it from the git repo)Seems you are missing the git repo, could that be?
because comparing experiments using graphs is very useful. I think it is a nice to have feature.
So currently when you compare the graphs you can select the specific scalars to compare, and it Update in Real Time!
You can also bookmark the actual URL and it is fully reproducible (i.e. full state is stored)
You can also add custom columns to the experiment table (with the metrics) and sort / filter based on them, and create a summary dashboard (again like ll pages in the web app, URL is...
ConvolutedChicken69
, does it take the agent off the queue? does it know it's not available to take tasks?
You mean will it "release" the GPU? (i.e. the agent will pull another Task) ?
If so, then no it will not, an "Interactive Session" session is (from the agent's perspective) a Task that will end sometime, and it will continue to monitor and run it, until you manually close it. The idea is that you are actually using the GPU, hence no on else can run a job on it.
To shut it down, ...
basically
would allow blocking the machine from being scaled-in when
Oh this is what I was missing 🙂 That makes sense to me!
So what you are saying is that the AWS autoscaler agent, when it is launching a Task, inside the container you will set "protection flag" when the Task ends, you will unset "protection flag"
Is that correct?
Hi NonsensicalSeaanemone47
I'm assuming you mean k8s as compute cluster?
If so, then yes clearml adds priority scheduling on top of your existing kl8s cluster. It also allows you to reuse images as the k8s spins the base container image and then inside the container image the agent sets the environment of the experiment (clones code, apply diff, install missing python packages etc.)
It also gives visibility into the executed pods.
Make sense ?
Hi @<1636175432829112320:profile|PlainSealion45>
- I used this initial model to create the endpoint with
model add
command.
I think that the initial model needs to be added with model auto-aupdate
Not with model add
basically do not call model add - this is static, always using the model ID specified (you can deploy new models with manually callign model add on the same endpoint and specifying diffrent model ID , but again manual)
To Automatically have the m...
So if any step corresponding to 'inference_orchestrator_1' fails, then 'inference_orchestrator_2' keeps running.
GiganticTurtle0 I'm not sure it makes sense to halt the entire pipeline if one step fails.
That said, how about using the post_execution callback, then check if the step failed, you could stop the entire pipeline (and any running steps), what do you think?
Hi @<1573119962950668288:profile|ObliviousSealion5>
Hello, I don't really like the idea of providing my own github credentials to the ClearML agent. We have a local ClearML deployment.
if you own the agent, that should not be an issue,, no?
forward my SSH credentials using
ssh -A
and then starting the clearml agent?
When you are running the agent and you force git clonening with SSH, it will autmatically map the .ssh into the container for the git to use
Ba...
Hi CheekyElephant36
First you need to run it once on your machine, once this is done (only a few steps is enough), you can one it and enqueue it. Then to actually connect the aws autoscaler (the part that spins machines and runs tasks) go to applications and select the aqs autoscaler.
Btw i think the next video will be about YOLO + autoscaler
Hi UnevenDolphin73 , are those per user/project/system environment variables ?
If these are secrets (that you do not want to expose), maybe it is best just to have them on he agent's machine ?
BTW, I think there is some "vault" support in the paid tiers for these kind of secret, not sure on which level (i.e. user/system/project)
JitteryCoyote63 I found it 🙂
Are you working in docker mode or venv mode ?
You are correct, the agent will clone the git and install the requirements, as written in the task installed packages section. Regrading the git branch, notice it will pull the specific commit id as stated in the execution section, and it will apply any uncommitted changes. You can edit the execution section and change the commit to the latest in a specific version (you should probably also clear the uncommitted changes of you do that)
Sorry found the code on the Task, duh 🙂
` # get_ipython().magic('pip install clearml')
import clearml
from clearml import Task
task = Task.init(project_name='examples', task_name='test param', reuse_last_task_id=False)
param = {
'tuple_double_quotes_r': (r"value\blah", 1),
'tuple_double_quotes': ("value\blah", 1),
'tuple_single_quotes': ('value\blah', 1),
"double_quotes_r": r"value\blah",
'double_quotes': "value\blah",
'single_quotes': 'value\blah'
...
That is correct.
Obviously once it is in the system, you can just clone/edit/enqueue it.
Running it once is a mean to populate the trains-server.
Make sense ?
Its stored on the Task, you can see it under the execution tab in the UI
ShaggyHare67 are you saying the problem is trains
fails discovering the packages in the manual execution ?
using this is it possible to add to requirements of task with task_overrides?
Correct, but you will be replacing (not adding) requirements
, how do different tasks know which arguments were already dispatched if the arguments are generated at runtime?
A bit of how clearml-agent works (and actually on how clearml itself works).
When running manually (i.e. not executed by an agent), Task.init (and similarly task.connect etc.) will log data on the Task itself (i.e. will send arguments /parameters to the server), This includes logint the argparser for example (and any other part of the automagic or manuall connect).
When run...
Could it be pandas was not installed on the local machine ?
Hi UpsetCrocodile10
First, I perform many experiments in one process, ...
How about this one:
https://github.com/allegroai/trains/issues/230#issuecomment-723503146
Basically you could utilize create_function_task
This means you have Task.init() on the mainn "controller" and each "train_in_subset" as a "function_task". Them the controller can wait on them, and collect the data (like the HPO does.
Basically:
` controller_task = Task.init(...)
children = []
for i, s in enumer...