Reputation
Badges 1
25 × Eureka!Hi ExcitedFish86
Of course, this is what it was designed for. Notice in the UI under Execution you can edit this section (Setup Shell Script). You can also set via task.set_base_docker
Hi SpotlessFish46 ,
Is the artifact already in S3 ?
Is the S3 configured as the default files_server in the trains.conf
?
You can always use the StorageManager upload to wherever and register the url on the artifacts.
You can also programmatically change the artifact destination server to S3, then upload the artifact as usual.
What would be the best natch for you?
Hi WhimsicalLion91
You can always explicitly send a value:from trains import Logger Logger.current_logger().report_scalar("title", "series", iteration=0, value=1337)
A full example can be found here:
https://github.com/allegroai/trains/blob/master/examples/reporting/scalar_reporting.py
So are you saying the large file size download is the issue ? (i.e. network issues)
MysteriousBee56 Edit in your ~/trains.conf:api_server:
http://localhost:8008
toapi_server:
http://192.168.1.11:8008
and obliviously the same for web & files
I'll make sure we fix the trains-agent to output an error message instead of trying to silently keep accessing the API server
Getting you machine ip:
just run :ifconfig | grep 'inet addr:'
Then you should see a bunch of lines, pick the one that does not start with 127 or 172
Then to verify runping <my_ip_here>
JitteryCoyote63 this is standard ssh authorized server removal
https://superuser.com/a/30089
specifically you can try:ssh-keygen -R 10.105.1.77
Hi CluelessElephant89
I'm thinking that different users might want to comment on results of an experiment and stuff. Im sure these things can be done externally on a github thread attached to the experiment
Interesting! Like a "comment section on top of a Task ?
Or should it be a project ?
Basically I have this intuition that Task granularity might be to small (I would want to talk about multiple experiments, not a single one?) and a project might be to generic ?
wdyt?
btw: The addr...
I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
Use the HPO, it is basically doing the same thing with some more sophisticated algorithm (HBOB):
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
For example - how would this task-based example be done with pipelines?
Sure, you could do something like:
` from clearml import Pi...
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
Every Task execution the agent clears the venv (packages are cached locally, but the actual venv is cleared). If you want you can turn on the venv cache, but there is no need to manually clear the agent's cache.
ReassuredTiger98 both are running with pip as package manager, I thought you mentioned conda as package manager, no?agent.package_manager.type = pip
Also the failed execution is looking for "ruamel_yaml_conda" but it is nowhere to be found on the original one?! how is that possible ?
Hi ReassuredTiger98
Could you send the log of both run ?
(I'm not sure this is a bug, or some misconfiguration , but the scenario should have worked...)
preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
Okay that's the part that I'm missing, how come in the first run the package existed and in the cloned Task they are missing? I'm assuming agents are configured basically the same (i.e. docker mode with the same network access). What did I miss here ?
Hi ExuberantParrot61 the odd thing is this, message
No repository found, storing script code instead
when you are actually running from inside the repo... (
is it saying that on a specific step, or is it on the pipeline logic itself?
Also any chance you can share the full console output ?
BTW:
you can manually specify a repo branch for a step:
https://github.com/allegroai/clearml/blob/a492ee50fbf78d5ae07b603445f4983feb9da8df/clearml/automation/controller.py#L2841
Example:
https:/...
(only works for pyroch because they have diff wheeks for diff cuda versions)
Hi GrievingTurkey78
Turning of pytorch auto-logging:Task.init(..., auto_connect_frameworks={'pytorch': False})
To manually log a model:from clearml import OutputModel OutputModel().update_weights('my_best_model.pt')
I see, let me check the code and get back to you, this seems indeed like an issue with the Triton configuration in the model monitoring scenario.
Oh that makes sense.
So now you can just get the models as dict as well (basically clearml allows you to access them both as a list, so it is easy to get the last created, and as dict so you can match the filenames)
This one will get the list of modelsprint(task.models["output"].keys())
Now you can just pick the best onemodel = task.models["output"]["epoch13-..."] my_model_file = model.get_local_copy()
It said the command --aux-config got invalid input
This seems like an interface bug.. let me see if we can fix that 🙂
BTW: this seems like a triton LSTM configuration issue, we might want to move the discussion to the Triton server issue, wdyt?
Definitely!
Could you start an issue https://github.com/triton-inference-server/server/issues , and I'll jump join the conversation?
. Is there any reference about integrating kafka data streaming directly to clearml-serving...
That's not possible, right?
That's actually what the "start_locally" does, but the missing part is starting it on another machine without the agent (I mean it totally doable, and if important I can explain how, but this is probably not what you are after)
I really need to have a dummy experiment pre-made and have the agent clone the code, set up the env and run everything?
The agent caches everything, and actually can also just skip installing the env entirely. which would mean ...
@<1540142651142049792:profile|BurlyHorse22> do you mean the one refereed in the video ? (I think this is the raw data in kaggle)
But only 1 node will copy it.
they can only copy it after the first is finished, and they are not aware it is trying to set the exact venv, hence the race
think this is because of the version of xgboost that serving installs. How can I control these?
That might be
I absolutely need to pin the packages (incl main DS packages) I use.
you can basically change CLEARML_EXTRA_PYTHON_PACKAGES
https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L100
for example:export CLEARML_EXTRA_PYTHON_PACKAGES="xgboost==1.2.3 numpy==1.2.3"
Using the dataset.create command and the subsequent add_files, and upload commands I can see the upload action as an experiment but the data is not seen in the Datasets webpage.
ScantCrab97 it might be that you need the latest clearml
package installed on the client end (as well as the new server with the UI)
What is your clearml package version ?
basically @<1554638166823014400:profile|ExuberantBat24> you can think of hyper-datasets as a "feature-store for unstructured data"
Plan is to have it out in the next couple of weeks.
Together with a major update in v0.16
Hmmm that sounds like a good direction to follow, I'll see if I can come up with something as well. Let me know if you have a better handle on the issue...
DilapidatedDucks58 so is this more like a pipeline DAG that is built ?
I'm assuming this is more than just grouping ?
(by that I mean, accessing a Tasks artifact does necessarily point to a "connection", no? Is it a single Task everyone is accessing, or a "type" of a Task ?
Is this process fixed, i.e. for a certain project we have a flow (1) executed Task of type A, then Task of type (B) using the artifacts fro Task (A). This implies we might have multiple Tasks of types A/B but they are alw...