I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?
Oh in that case add --remote-gateway <external_ip>
It will connect to the provided address instead of the local one. (you can also add --public-ip
which will automatically resolve the public IP of the server
Hi DefeatedCrab47
You mean by trains-agent, or accumulated over all experiences ?
You can see the class here:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/clearml/binding/frameworks/init.py#L52
Basically you do:
` def my_callback(load_or_save, model):
# type: (str, WeightsFileHandler.ModelInfo) -> WeightsFileHandler.ModelInfo
assert load_or_save not in ('load', 'save')
# do something
if skip:
return None
return model
WeightsFileHandler.add_pre_callback(my_callback) `
GrievingTurkey78 Actually it is in progress, see the GitHub issue for details:
https://github.com/allegroai/trains/issues/219
Your code should have worked, i.e. you should see the 'model.h5' in the artifacts tab. What do you have there?
It should look something like this one:
https://demoapp.trains.allegro.ai/projects/531785e122644ca5b85b2e19b0321def/experiments/e185cf31b2634e95abc7f9fbdef60e0f/artifacts/output-model
BTW:
To manually register any model:
from trains import Task, OutputModel task = Task.init('examples', 'my model') OutputModel().update_weights('my_best_model.h5')
HealthyStarfish45 my apologies, they do have it (this ability needs support for both trains-agent and server) but not in the open-source ...
Hi JitteryCoyote63
What do you have in the agent.cuda_version
?
(you can see it printed at the beginning of the log)
Hi ColossalDeer61 ,
the next trains-agent RC (solving the #196 issue) will also solve the double install issue 🙂
Hi TrickyRaccoon92 , TB is automatically collected and converted into data stored on the system The UI uses plotly to display the data itself (on your web browser).
You still have the original TB protobuf file, if you want to dive deeper and debug the data (it is not automatically uploaded, but some users do upload it as additional artifact on the experiment)
Make sense ?
t = Task.get_task('aabbcc') t.update_task(task_data={'task_type': "optimizer"})
on the host machine or inside the containers that are spinning on the host machine ?
Firstly, thank you for your efforts and your support.
Thanks SmugOx94 !
Are you running trains-agent
in docker mode? The aforementioned scripts are executed before, the experiment is being cloned, they are meant to be a part of the docker setup, not a per experiment script.
You could try to edit the experiment and have:
Working Directory: "."
(that means the root of the repository)
Script Path: "experiments_that_uses_library/train.py"
This will make sure you can do "import l...
Well I guess you can say this is definitely not self explanatory line 😉
but, it is actually asking whether we should extract the code, think of it as:if extract_archive and cached_file: return cls._extract_to_cache(cached_file, name)
Hmm, you are missing the entry point in the execution (script path).
Also as I mentioned you can either have a git repo or script in the uncommitted changes, but not both (if you have a git repo then the uncommitted changes are the git diff)
Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?
I prefer serving my models in-house and only performing the monitoring via ClearML.
clearml-serving
is an infrastructure for you to run models 🙂
to clarify, clearml-serving
is running on your end (meaning this is not SaaS where a 3rd party is running the model)
By the way, I saw there is a project dashboard app which might support the visualization I am looking for. Is it suitable for such use case?
Hmm interesting, actually it might, it does collect matrices over time ...
Hi UnsightlySeagull42
Basically you can get the agent to always add additional arguments for the docker run, such as -v for mounting:
https://github.com/allegroai/clearml-agent/blob/948fc4c6ce1ecf33a74619ad570d69b8188f6db9/docs/clearml.conf#L133
do I need to have the repo that I am running on my account
If it is a public repo, then no need, credentials are only needed for private repos 🙂
Am I missing something ?
WickedGoat98 I suspect the main difference is with GitHub your are cloning with https (i.e. not credentials needed) , but with gitlab you are using SSH authentication to clone the repository .If on the machine running the trains-agent
you can "git clone" your repository (i.e. from command line), the trains-agent should be able to do the same (basically make sure you have the SSH keys in your ~/.ssh folder.
Are you testing the trains-agent service from (i.e. from the docker compose) o...
im not running in docker mode though
hmmm that might be the first issue. it cannot skip venv creation, it can however use a pre-existing venv (but it will change it every time it installs a missing package)
so setting CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 in non docker mode has no affect
Hmm I see, add this for example
extra_docker_shell_script: ["rm ~/.bashrc", "echo removed bashrc"]
Maybe this one?
https://github.com/allegroai/clearml/issues/448
I think it is already there (i.e. 1.1.1)
Hi UnsightlyShark53 , just a quick FYI, you can also log the entire config file config.json
this will be stored as model configuration, and you can see it in the input/output models under the artifacts tab.
See example here you can path either the path to the configuration file, or the dictionary itself after you loaded the json, whatever is more convenient :)
Hi MagnificentPig49 unfortunately it's only in the trains-server docker, we are working on making it "presentable" and uploading it to it's repo.
It's written in Angular (v8 I think). Do you want to help out, it will definitely incentive the guys to tidy up the code and upload it :)
Hi WorriedParrot51
Let me shed some light on this complicated mechanism, because this is not very straight forward.
Basically the agent signals the trains package it should ignore the code calls, and use a specific Task in the backend (i.e. if in manual mode, the trains package logs the data into the trains-server, in agent mode (remote mode), it does the opposite and takes the data from the trains-server "into" the code)
Specifically, just like in manual mode, calling argparse.parse is be...
Regulatory reasons and proprietary data is what I had in mind. We have some projects that may need to be fully self hosted in the end
If this is the case then, yes do self-hosted, or talk to clearml sales to get the VPC option, but SaaS is just not the right option
I might take a look at it when I get a chance but I think I'd have to see if ClearML is a good fit for our use case before I can justify the commitment
I hope it is 🙂
: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
This is the current state.
Downloading the artifacts is done only when actually calling get()/get_local_copy()