
Reputation
Badges 1
25 × Eureka!BurlyRaccoon64 by default if .ssh exists in the host user folder it should mount it to the container (actually mount a copy of it). do you have a log of two tasks from two diff machines, one failing one passes? because this is quite odd (assuming the setup itself is identical)
That makes sense to me, what do you think about the following:
` from clearml import PipelineDecorator
class AbstractPipeline(object):
def init():
pass
@PipelineDecorator.pipeline(...)
def run(self, run_arg):
data = self.step1(run_arg)
final_model = self.step2(data)
self.upload_model(final_model)
@PipelineDecorator.component(...)
def step1(self, arg_a):
# do something
return value
@PipelineDecorator.component(...)
def step2(self, arg_b):
# do ...
Hi ContemplativePuppy11
This is really interesting point.
Maybe you can provide a pseudo class abstract of your current pipeline design, this will help in trying to understand what you are trying to achieve and how to make it easier to get there
Hi @<1554275802437128192:profile|CumbersomeBee33>
what do you mean by "will the dependencies will be removed or not" ?
The next time the agent spin a new Task it will create a new venv and delete the previous one
EnormousWorm79 you mean to get the DAG graph of the Dataset (like you see in the plots section)?
LudicrousDeer3 when using Logger you can provide 'iteration' argument, is this what you are looking for?
Yes I was thinking a separate branch.
The main issue with telling git to skip submodules is that it will be easily forgotten and will break stuff. BTW the git repo itself is cached so the second time there is no actual pull. Lastly it's not clear on where one could pass a git argument per task. Wdyt?
ThickDove42 looking at the code, I suspect it fails interacting with the actual jupyter server (that is running on the same machine, but still).
Any chance you have a firewall on the Windows machine ?
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I...
Notice the configuration parameters:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L160
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L162
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L156
Are these experiments logged too (with the train-valid curves, etc)?
Yes every run is log as a new experiment (with it's own set of HP). Do notice that the execution itself is done by the "trains-agent". Meaning the HP process creates experiments with new set of HP an dputs them into the execution queue, then trains-agent
pulls them from the queue and starts executing them. You can have multiple trains-agent
on as many machines as you like with specific GPUs etc. each one ...
Hi SourSwallow36
What do you man by Log each experiment separately ? How would you differentiate between them?
Hi @<1523706645840924672:profile|VirtuousFish83>
could it be you have some permission issues ?
: Forbidden: updates to statefulset spec for fields other than 'replicas',
It might be that you will need to take it down and restart it. not while it is running.
(do make sure you backup your server 🙂 )
Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py
GreasyPenguin14 I think the default is reporting on failed tasks only? could that be?
Obviously if you click on them you will be able to compare based on specific metric / parameters (either as table or in parallel coordinates)
Assuming you are using docker-compose, the console output is a good start
GrievingTurkey78 can you send the entire log?
If the manual execution (i.e. pycharm) was working it should have stored it on the Pipeline Task.
looks like a great idea, I'll make sure to pass it along and that someone reply 🙂
Sure thing, let me know ... 🙂
LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂
Hi FiercePenguin76
It seems it fails detecting the notebook server and thinks this is a "script running".
What is exactly your setup?
docker image ?
jupyter-lab version ?
clearml version?
Also are you getting any warning when calling Task.init ?
Oh, so the pipeline basically makes itself their parent, this means you can get their IDs:steps_ids = Task.query_tasks(task_filter=dict(parent=<pipeline_id_here)) for task_id in steps_ids: task = Task.get_task(task_id)