Reputation
Badges 1
25 × Eureka!What about the epochs though? Is there a recommended number of epochs when you train on that new batch?
I'm assuming you are also using the "old" images ?
The main factor here is the ratio between the previously used data and the newly added data, you might also want to resample (i.e. train on more) new data vs old data. make sense ?
now i cant download neither of them
would be nice if address of the artifacts (state and zips) was assembled on the fly and not hardcoded into db.
The idea is this is fully federated, the server is not actually aware of it, so users can manage multiple storage locations in a transparent way.
if you have any tips how to fix it in the mongo db that would be great ....
Yes that should be similar, but the links would be in artifact property on the Tasks object
not exactly...
To automate the process, we could use a pipeline, but first we need to understand the manual workflow
i've tried setting up a clearml application on openshift
First, my condolences 🙂 openshift ...
Second, what you need to make sure is that each container (i.e. ELK/Monogo etc) has their own PV for persistent storage , I'm assuming this is the root cause for the error.
Make sense to you ?
Nice debugging experience
Kudos on the work !
BTW, I feel weird to add an issue on their github, but someone should, this generic setup will break all sorts of things ...
Hi VexedCat68
Could it be the python version is not the same? (this is the only reason not to find a specific python package version)
This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queu...
Is there any way to make that increment from last run?
pipeline_task = Task.clone("pipeline_id_here", name="new execution run here")
Task.enqueue(pipeline_task, queue_name="services")
wdyt?
there was a problem with index order when converting from pytorch tensor to numpy array
HealthyStarfish45 I'm assuming you are sending numpy to report_image (which makes sense) if you want to debug it, you can also test tensorboard add_image or matplotlib imshow. both will send debug images
https://stackoverflow.com/questions/60860121/plotly-how-to-make-an-annotated-confusion-matrix-using-a-heatmap
MagnificentSeaurchin79 see plotly example here:
https://allegro.ai/clearml/docs/docs/examples/reporting/plotly_reporting.html
Probably not the case the other way around.
Actually the other way around, new pip version uses new package dependency resolver that can concluded that a previous package setup is not supported (because of version conflicts) even though they worked...
It is tricky, pip is trying to get better at resolving package dependencies, but it means that old resolutions might not work which would mean old environments cannot be resorted (or "broken" env). This is the main reason not to move to p...
The file is never touched, nowhere in the process that file is deleted.
it should never have gotten there, this is not the git repo folder, it one level above...
but I still need the laod ballancer ...
No you are good to go, as long as someone will register the pods IP automatically on a dns service (local/public) you can use the regsitered address instead of the IP itself (obviously with the port suffix)
Thanks for your support
With pleasure!
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode
to allow for multiple pipeline components on the same machine
Hi RobustGoldfish9 ,
I'd much rather just have trains-agent just automatically build the image defined there than have to build the image separately and make it available for all the agents to pull.
Do you mean there is no docker image in the artifactory built based on your Dockerfile ?
SmarmyDolphin68
BTW: there is no automatic reporting when you have task = Task.get_task(task_id='your_task_id')
It's only active when you have one "main" task.
You can also check the continue_last_task
argument in Task.init , it might be a good fit for your scenario
https://allegro.ai/docs/task.html#trains.task.Task.init
Hmm whats the OS and python version?
Is this simple example working for you?
None
Hi GreasyPenguin66
Is this for the client side ? If it is why not set them in the clearml.conf ?
Hi PompousBeetle71
I remember it was an issue, but it was solved a while ago. Which Trains version are you using?
Okay this is a bit tricky (and come to think about it, we should allow a more direct interface):pipe.add_step(name='train', parents=['data_pipeline', ], base_task_project='xxx', base_task_name='yyy', task_overrides={'configuration.OmegaConf': dict(value=yaml.dump(MY_NEW_CONFIG), name='OmegaConf', type='OmegaConf YAML')} )
Notice that if you had any other configuration on the base task, you should add them as well (basically it overwrites the configurati...
PipelineController works with default image, but it incurs overhead 4-5 min
You can try to spin the "services" queue without docker support, if there is no need for containers it will accelerate the process.
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
This error is about failing to clone the pipeline code repo, how is that connected to changing the container ?!
Can you provide the full log?
This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Correct 🙂
Hi DilapidatedDucks58
is this something new ?
usually copy pasting directly from the UI parses everything, no?
Done HandsomeCrow5 +1 added 🙂
btw: if you feel you can share how your reports looks like (screen shot is great), this will greatly help in supporting this feature , thanks
` from time import sleep
from clearml import Task
import tqdm
task = Task.init(project_name='debug', task_name='test tqdm cr cl')
print('start')
for i in tqdm.tqdm(range(100)):
sleep(1)
print('done') `The above example code will output a line every 10 seconds (with the default console_cr_flush_period=10) , can you verify it works for you?