ColossalDeer61 btw, it turns out the docker-compose services docker was ill configured on the GitHub 😞 I suggest you get the latest copy of it:curl -o docker-compose.yml
but not as a component (using the decorator)
Hmm yes, I think that component calling component as an external component is not supported yet
(basically the difference is , is it actually running as a function, or running on a different machine as another pipeline component)
I noticed that when a pipeline step returns an instance of a class, it tries to pickle.
Yes this is how the serialization works, when we pass data from one node to another (by design it supports multiple mach...
I'll try to create a more classic image.
That is always better, though I remember we have some flag to allow that, you can try with:CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1 clearml-agent ...
Hi PompousBeetle71
I remember it was an issue, but it was solved a while ago. Which Trains version are you using?
Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside theÂ
if name == "main":
 as seen above in the code snippet.
I'm not sure I follow, the error seems like your internal code issue, does that means clearml works as expected ?
DeliciousBluewhale87 this is exactly how it works,
The glue puts a k8s job with the requested docker image (the one on the Task), the job itself (k8s job) starts the agent inside the requested docker, then the agent inside the docker will install all the required packages.
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
Hi PunyPigeon71
Can you send the log from the remote execution?
Can you see on the Task in the UI , under execution tab, the correct git repo reference, commit ID, and uncommitted changes?
When I have:n = 20 duration = 1000 now = time.mktime(time.localtime()) timestamps = np.linspace(now, now + duration, n) dates = [dt.datetime.fromtimestamp(ts) for ts in timestamps] values = np.sin((timestamps - now) / duration * 2 * np.pi) fig = go.Figure(data=go.Scatter(x=dates, y=values, mode='markers')) task.get_logger().report_plotly( title="plotly", series="b", iteration=0, figure=fig)Everything looks okay
From creating the event to actually sending it ... 30 min sounds like enough "time"...
As I suspected, from your log:agent.package_manager.system_site_packages = falseWhich is exactly the problem of the missing tensorflow (basically it creates a new venv inside the docker, but without the flag On, it does not inherit the docker preinstalled packages)
This flag should have been true.
Could it be that the clearml.conf you are providing for the glue includes this value?
(basically you should only have the sections that are either credentials or missing from the default, there...
I'll make sure we fix the example, because as you pointed, it is broken :(
Hi @<1597399925723762688:profile|IrritableStork32>
I think that if you have clearml installed an configured on your machine it should just work:
None
Hi WickedBee96
Queue1 will take 3GPUs, Queue2 will take another 3GPUs, so in Queue3 can I put 2-4 GPUs??
Yes exactly !
if there are idle GPUs so take them to process the task? o
Correct, basically you are saying, this queue needs a minimum of 2 GPUs, but if you have more allocate them to the Task it pulled (with a maximum of 45 GPUs)
Make sense ?
Hi ResponsiveCamel97
The agent generates a new configuration file to be mounted into the docker, with all the new folders as they will be seen inside the docker itself. One of the changes is the system_site_packages as inside the docker we want the new venv to inherit everything from the docker system installed packages.
Make sense ?
Hover over the border (I would suggest to use the full screen, i.e. maximize)
Hi LazyLeopard18
I remember someone deploying , specifically on the AZURE k8s (can't remember now how they call it).
What is exactly the feedback you are after?
ElegantKangaroo44 my bad 😞 I missed the nuance in the description
There seems to be an issue in the web ui -> viewing plots in "view in experiment table" doesn't respect the "scalars to display" one sets when viewing in "view in fullscreen".
Yes the info-panel does not respect the full view selection, It's on the to do list to add this ability, but it is still no implemented...
I set up the alert rule on this metric by defining a threshold to trigger the alert. Did I understand correctly?
Yes exactly!
Or the new metric should...
basically combining the two, yes looks good.
Can you verify it fixes the timeout issue as well? (or some insight on how to reproduce the issue?)
JitteryCoyote63 oh dear, let me see if we can reproduce (version 1.4 is already in internal testing, I want to verify this was fixed)
just got the pipeline to runÂ
Nice!
using the default queue okay?
Using the default queue is fine. The different queue is the "services" queue that by default the "trains-server" is running an agent the will pull jobs from there.
With "services" mode, an agent will pull jobs right after the other (not waiting for the previous job to finish), as opposed to regular queue (any other) that the trains-agent will pull a job only after the previous one completed .
Are you running a jupyter notebook inside vscode ?
It's always the details... Is the new Task running inside a new subprocess ?
basically there is a difference between
remote task spawning new tasks (as subprocesses, or as jobs on remote machine), remote task still running remote task, is being replaced by a spawned task (same process?!)UnevenDolphin73 am I missing a 3rd option? which of these is your case?
p,s. I have a suspicion that there might be a misuse of "Task" here?! What are you considering a Task? (from clearml perspective a Task...
so what should the value of "upload_uri" to set to,Â
fileserver_url
 e.g.Â
 ?
yes, that would work.
No worries, basically they are independent, spin your JupyerHub , then every user will have to set their own credentials on the JupyterLab instance they use. Maybe there is a way to somehow connect a specific OS environment user->JupyterLab in JupyterHub, that would mean users do not have to worry about credntials. wdyt?
