Reputation
Badges 1
25 × Eureka!Hi JealousParrot68
clearml tracking of experiments run through kedro (similar to tracking with mlflow)
That's definitely very easy, I'm still not sure how Kedro scales on clusters. From what I saw, and I might have missed it, it seems more like a single instance with sub-processes, but no real ability to setup diff environment for the diff steps in the pipeline, is this correct ?
I think the challenge here is to pick the right abstraction matching. E.g. should a node in kedro (w...
in the docker-compose file. Still strange...
hmm yes it is... If you have an idea on what went wrong let me know, we would love to fix it
Could you send the "installed packages" section of the Task that was created in the notebook ?
BTW:
======> WARNING! Git diff to large to store (1327kb), skipping uncommitted changes <======
This means all your git changes are stored as an artifact, which is consistent with the "wait for upload" message.
Hi SmugOx94
Hmm are you creating the environment manually, or is it done by Task.init ?
(Basically Task.init will store the entire environment of conda, and if the agent is working with conda package manager it will use it to restore it)
https://github.com/allegroai/clearml-agent/blob/77d6ff6630e97ec9a322e6d265cd874d0ab00c87/docs/clearml.conf#L50
Hi HealthyStarfish45
You can disable the entire TB logging :Task.init('examples', 'train', auto_connect_frameworks={'tensorflow': False})
It is the folder the clearml creates and the folder we create ourself to store the predictions
I see... If that is the case, the only solution I can think of is manually uploading the files with StorageManager(...) then get the url, and register it as debug_media or artifact:logger.report_media("image", "type a", iteration=iteration, url="
") task.upload_artifact('a link', artifact_object='
')
` Collecting inplace-abn==1.0.12
Downloading inplace-abn-1.0.12.tar.gz (137 kB)
ERROR: Command errored out with exit status 1:
command: /home/ubuntu/.clearml/venvs-builds/3.8/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"'; file='"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f...
CurvedHedgehog15 the agent has two modes of opration:
single script file (or jupyter notebook), where the Task stores the entire file on the Task itself. multiple files, which is only supported if you are working inside a git repository (basically the Task stores a refrence to the git repository and the agent pulls it from the git repo)Seems you are missing the git repo, could that be?
VirtuousFish83
could that be that "inplace-abn" while installing the package needs torch ?
I look forward to your response on Github.
Great, I would like to make this discussion a bit more open and accessible so GitHub is probably better
I'd like to start contributing to the project...
That will be awesome!
Hi LazyFox65
So the idea is that you add two lines of code to your codebase :from clearml import Task task = Task.init(project_name='examples', task_name='change me')
And you run it once, then it will create the experiment, environment arguments etc.
Now that you have it in the UI you can clone / change all the fields and send for execution.
That said you can also create an experiment from CLI (basically pointing to a repo and entry point)
You can read here:
https://github.com/allegroa...
Should I make a new issue or just reply on the one I mentioned above?
Maybe a new issue with the merge, and then the hack+fix? what do you think?
EnviousStarfish54 regrading file server, you have one built into the trains-server, and this will be the default location to store all artifacts. You can also use external solutions like S3 GS Azure etc.
Regarding the models, any model store / load is automatically logged as long as you are using one of the supported frameworks (TF Keras PyTorch scikit learn)
If you want your model to be automatically uploaded, just add outpu_uri:
task=Task.init('examples', 'model', output_uri=' http://trai...
Hi EnthusiasticCoyote38
But one one process finished it changed task status to complete. May be you know some save way to deal with such situation? Or maybe the best way to check task status before upload object?
Well, you can actually forcefully set the state of the Task to running, then add artifacts, then close it?
would that work?
` my_other_task.reload()
my_other_task.mark_started(force=True)
my_other_task.upload_artifact(...)
my_other_task.flush(wait_for_uploads=True)
my_othe...
Hi EnviousStarfish54
Artifacts are stored per experiment, that means that storage wise every experiment uploading an artifact (even if it is the same file content as previous execution) will create a new file on the central storage (default being the trains-server)
As for the preferred way to share data / artifacts. Where do you have your trains server ? Is it local ? Cloud? Where do you access it from home? VPN?
It is recommended to create a git TOKEN with read only permissions and use it (more secure) 🙂
MelancholyElk85 notice there is the pipeline controller queue (i.e. which agent will run the logic of the pipeline), and the default queue for the pipeline steps (i.e. the actual steps of the pipeline).
The default queue for the pipeline logic itself is services
. you can change it ( pipeline.start(..., queue='another_q')
)
Make sense ?
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
SmarmyDolphin68
Debug Samples tab and not the Plots,
Are you doing plt.imshow
?
Also make sure you have report_image=False
when calling the report_matplotlib_figure
(if it is true it will upload it as an image to "debug samples")
As we use a custom CUDA image, we do not want this running on user login, and get ugly error messages about missing symlinks.
You can customize the startup bash script (running inside Any container) here:
https://github.com/allegroai/clearml-agent/blob/bf07b7f76d3236c1118b81730c6d9718705a795a/docs/clearml.conf#L145
LackadaisicalOtter14 Would that help?
So what will you query ?
Maybe that's the issue :
https://github.com/googleapis/python-storage/issues/74#issuecomment-602487082
suspect permissions, but not entirely sure what and where
Seems like it.
Check the config file on the agent machine
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L18
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L19
Hi OutrageousGrasshopper93
which framework are you using? trains-agent will pull the correct torch based on the cuda version it detects, but no such thing for TF the default venv mode, trains-agent creates a new venv for the experiment (not conda) then everything is installed there. If you need conda you need to change the package_manager to conda: https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c6736d12614de9870eff48bc/docs/trains.conf#L49 The safest way to control CUDA dri...
Hmm I'm assuming something wrong here:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L119
What's the host machine OS ?