Hi RotundHedgehog76
Notice that the "queued" is on the state of the Task, as well as the the tag
We tried to enqueue the stopped task at the particular queue and we added the particular tagWhat do you mean by specific queue ? this will trigger on any Queued Task with the 'particular-tag' ?
this is not the case as all the scalars report the same iterations
MassiveHippopotamus56 could it be the the machine statistics? (i.e. cpu/gpu etc. these are considered scalars as well...)
BTW: CloudyHamster42 I think this issue was discussed on GitHub, and the final "verdict" was we should have an option to split/combine graphs on the UI side (i.e. similar to the "smoothing" or wall-time axis etc.)
Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch ver...
This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
Hi PanickyMoth78
it was uploading fine for most of the day but now it is not uploading metrics and at the end
Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events
This seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens
Hi DrabCockroach54
I think the Kubernetes integration (k8s glue) is not part of the open-source features, and is only available as enterprise feature 😞
RoundMosquito25 good news, no no need to open any ports 🙂
Basically B_i agents are always polling the server for "jobs" create an http/s request from them to the server, so all connections are out connections. Firewall is intact 🙂
Hmm seems like everything is working, can you check in the UI if you see the serving session ID in the DevOps project? maybe there are two, and you configured one an dthe docker-compose is running another ?
Hi ConvolutedSealion94
Just making sure, you spinned the docker-compose of the clearml serving as well ?
yes, I do, I added a
auxiliary_cfg
and I saw it immediately both in CLI and in the web ui
How many Tasks do you see in the UI in DevOps project with the system Tag SERVING-CONTROL-PLANE
?
TBH our Preprocess class has an import in it that points to a file that is not part of the preprocess.py so I have no idea how you think this can work.
ConvolutedSealion94 actually you can add an entire folder as preprocessing, including multiple files
See example des...
AttractiveCockroach17 could it be Hydra actually kills these processes?
(I'm trying to figure out if we can fix something with the hydra integration so that it marks them as aborted)
clearml-task
seems does not allow me passing the
run
argument without value
EnviousStarfish54 did you try --args run=True
I'm assuming run is a boolean of a sort ?
a. The submitted job would automatically download data from internal data repository, but it will be time consuming if data is re-downloaded every time. Does ClearML caching the data somewhere?
What do you mean by the agent will download the data ? are you referring to Dataset
?
So I wonder - why should an agent be related to a specific user's credentials? Is the right way to go about this is to create a "fake user" for the sake of the agent?
Very true you have to have credentials for the trains-agent, so it can "report" to the trains-server, that said, the creator of the Task (i.e. the person who cloned it) will be registered as the "user" in the UI.
I would recommend to create an "agent" user and put it's credentials on the trains-agent machine (the same way...
FrothyShark37 any chance you can share snippet to reproduce?
Sorry, you are correct this is where the json is created:
https://github.com/huggingface/transformers/blob/040283170cd559b59b8eb37fe9fe8e99ff7edcbc/src/transformers/feature_extraction_utils.py#L470
other links are the function calling it. make sense ?
Won't it be too harsh to have system wide restriction like that ?
Hi @<1576381444509405184:profile|ManiacalLizard2>
If you make sure all server access is via a host name (i.e. instead of IP:port, use host_address:port), you should be able to replace it with cloud host on the same port
I see..
Generally speaking If that is the case, I would think it might be better to use the docker mode, it offers way more stable environment, regardless on the host machine runinng the agent. Notice there is no need to use custom containers, as the agent will basically run the venv process, only inside a container, allowing you to reuse offf the shelf containers.
If you were to add this, where would you put it? I can use a modified version of
clearml-agent
Yep, that would b...
an implementation of this kind is interesting for you or do you suggest to fork
You mean adding a config map storing a default trains.conf for the agent?
Hey SarcasticSparrow10 see here 🙂
https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_linux_mac.html#upgrading
Maybe this one?
https://github.com/allegroai/clearml/issues/448
I think it is already there (i.e. 1.1.1)
Hi BeefyHippopotamus73
. I checked the template task and the list of “Installed Packages” indeed does not have one of my required packages in the list.
Basically the "installed packages" is auto populated based on the directly imported packages n your code base.
Could it be you do not have import snowflake-connector-python
and this is a derivative package (i.e. required from a different package)
BTW: when you clone your Task in the UI you can edit and add the missing packages,...
DeliciousSeal67
are we talking about the agent failing to install the package ?
JitteryCoyote63 so now everything works as expected ?
Could you send the "installed packages" section of the Task that was created in the notebook ?
LovelyHamster1 what do you mean by "assume the permissions of a specific IAM Role" ?
In order to spin an ec2 instance (aws autoscaler) you have to have correct credentials, to pass those credentials you must create a key/secret pair to pass to the autoscaler. There is no direct support for IAM Role. Make sense ?
MysteriousBee56 yes, please change the trains code!!! Wee pee, if you think someone else can benefit, feel free to PR :)
Regrading the double entry, that seems like an odd bug, how can I reproduce it?
The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?