
Reputation
Badges 1
25 × Eureka!Hi WackyRabbit7
I have a pipeline controller task, which launches 30 tasks. Semantically there are 10 applications, and I run 3 tasks for each (those 3 are sequential, so in the UI it looks like 10 lines of 3 tasks).
👍
In one of those 3 tasks that run for every app, I save a dataframe under the name "my_dataframe".
I'm assuming as an artifact:
What I want to achieve is once all tasks are over, to collect all those "my_dataframe" artifacts (10 in number), extract a sin...
where people can do @'s for experiments/projects/tasks and even comparisons ...
ohhh I like that! for me this throws me directly to Slack integration .
I think my main question is, "is the discussion ephemeral?" in other words, is this an on going discussion that later no one will care about, or are we creating some "knowledge base" that we want to later share?
Also, by "address bar at the top", i assume you mean address url right?
yes... apologies for the phrasing, it was w...
Oh this is so internally, the background thread can signal it is not deferred, are you saying there is bug or the code is odd?
Hi TrickyRaccoon92
Are you sure plotly (the front-end module displaying the plots in the UI) supports it ?
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
Hi, what is host?
The IP of the machine running the ClearML server
Regrading the project name:
set_project will support project_name in the next version 🙂 project_id=[p.id for p in Task.get_projects() if p.name==project_name][0]
I'm checking the possibility of our firewall between the
clearml-agent
machine and the local computer running the
session
Maybe... the thing is, how come the session creates a Task, push it into the queue, but the Task itself is empty.
Hence my request for the clearml-session console log, like actual copy paste of what you have in the terminal, not the Task log from the UI
Oh!
I see this is using the colab as remote agent (i.e. to launch jobs on it),
[ColabKernelApp] CRITICAL | Bad config encountered during initialization: The 'kernel_class' trait of <main.ColabKernelApp object at 0x7fa41b29e5c0> instance must be a type, but 'google.colab._kernel.Kernel' could not be imported
Can you send the full log?
SubstantialElk6 try to add -e CLEARML_AGENT_EXTRA_PYTHON_PATH=/code/app/flair
It should add it to the runtime pythonpath
(to the BASE DOCKER IMAGE on the Task itself)
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
I guess the thing that's missing from offline execution is being able to load an offline task without uploading it to the backend.
UnevenDolphin73 you mean like as to get the Task object from it?
(This might be doable, the main issue would be the metrics / logs loading)
What would be the use case for the testing ?
Hi @<1523706645840924672:profile|VirtuousFish83>
could it be you have some permission issues ?
: Forbidden: updates to statefulset spec for fields other than 'replicas',
It might be that you will need to take it down and restart it. not while it is running.
(do make sure you backup your server 🙂 )
re-running this code produces the same printoutsJust to be clear, you are saying the "random" results are consistent over runs ?
If I don't specify the type for N in the component I get an error because N is interpreted as a string.
Yes the default value is used for proper casting, In the next version we will use the type hints for that as well 🙂
If I un-comment the last two lines and rerun this script, the second pipeline call results in an error:I think that If you need multiple p...
Have a grid view (e.g. 3 plots per line instead of just one)Yes the plots are resizable move the cursor to the separating line and drag 🙂
2. Check the group by section, they can be split per series (like in TB)
Generally speaking I would say the Nvidia deep-learning AMI:
https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
Yes I think the writer.add_figure
somehow crops the image
we have some other parts, and for some cases we get initialization time can be about 10 times the experiment time
Before I dive into some agent in agent hacking, I would consider "caching" this preprocessing on an auxiliary Task as an artifact. Basically add another argument for the auxiliary Task, and fetch the data from it (obviously you will need to run it once before the optimizer launches the first experiment).
Now that is out of the way (which really would be the preferred engin...
Hmm TrickyRaccoon92 take a look at the cleanup service, I think you can hack it so instead of deleting the artifacts, it will archive them somewhere (also you can change the filter, maybe only perform on experiments with specific user tag)
What do you think?
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
Oh, I was assuming you are passing the entire DB backups to the cloud.
Are you saying you just want the file server on the cloud ? if this is the case, I would just use S3
sudo curl -L "
-s)-$(uname -m)" -o /usr/local/bin/docker-compose
Hi @<1576381444509405184:profile|ManiacalLizard2>
If you make sure all server access is via a host name (i.e. instead of IP:port, use host_address:port), you should be able to replace it with cloud host on the same port
follow the backup procedure, it is basically the same process
Is this consistent on the same file? can you provide a code snippet to reproduce (or understand the flow) ?
Could it be two machines are accessing the same cache folder ?
Is there any way to make that increment from last run?
pipeline_task = Task.clone("pipeline_id_here", name="new execution run here")
Task.enqueue(pipeline_task, queue_name="services")
wdyt?
If a Task is in the 'Completed' I think the only option is to 'Reset' it (see image).
In the UI yes, in code you can do task.mark_aborted(force=True)
You do clear the previous run execution but I think for a repetitive task this is fine.
I would avoid that, no?