Reputation
Badges 1
25 × Eureka!The wheel you download from pip, for example this one torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl
is actually both CPU and cuda 117
, is the team open to PRs from external people?
Yes please do! PRs are welcomed! I thought we fixed the GitHub readme to reflect it, anyhow I'll make sure we do π
Then what happens is thatΒ
Task.current_task()
Β returnsΒ
None
Β for the pipeline's task...
Hmm that sounds like the pipeline Task was closed?! could that be? where (in the code) is the call to Task.current_task ?
Do we have it on the git issue ?
Whatβs the general pattern for running a pipeline - train model, evaluate metrics and publish the model if satisfactory (based on a threshold, for example)
Basically I would do:
parameters for pipeline:
TaskA = Training model Task (think of it as our template Task)
Metric = title/series/sign we want to choose based on, where sign is max/min
Project = Project to compare the performance so that we could decide to publish based on the best Metric.
Pipeline:
Clone TaskA Change TaskA argu...
That's the theory, I still see it is not there
Bad news, there isn't a nice interface to get the table from the Optimizer object (I will make sure we add it, no reason not to).
But you can very easily get all the information you need and more:all_the_tasks = an_optimizer.get_top_experiments(top_k=100)Then for every task in the list you can get All the information:for task in all_the_tasks: task_params_as_dict = task.get_parameters() task_scalars = task.get_last_scalar_metrics()Basically the Task object enables you to que...
PanickyMoth78
LockException: [Errno 11] Resource temporarily unavailable
I'm not sure I understand how you got to this error (obviously creating datasets and getting them back works), what is unique in the setup/flow itself ?
Ohh, I see now, yes that should be fixed as well π
LudicrousParrot69 you mean post execution or while you are executing the hyperparameter optimizer ?
Could you please add it, I really do not want to miss it π
I can't think of any hack that will satisfy your IT other than than an actual vault...
wdyt?
We might need to change the default base docker image, but I remember it was there... Let me check again
Okay, I'll make sure we change the default image to the runtime flavor of nvidia/cuda
HugeArcticwolf77 I think this issue was resolved with the latest version 1.8.0, can you try to rerun the entire pipeline with the latest version?
You mean I can do Epoch001/ and Epoch002/ to split them into groups and make 100 limit per group?
yes then the 100 limit is per "Epoch001" and another 100 limit for "Epoch002" etc. π
@<1523710674990010368:profile|GreasyPenguin14> make sure it to uses https not ssh:
edit ~/clearml.conf
force_git_ssh_protocol: false
and that you have both git_user & git_pass set in your clearml.conf
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
Thanks SubstantialElk6 !
I believe an initial a fix was pushed π A full one (merging Task --env with k8s template) will be added soon
okay so the error should have been:
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://<IP>:8008 ?
Not https nor 8010 ?!
However, this one should be a feature to work on, and should be fairly easy to implement.
Feel free to add as GitHub issue π
Main challenge is understanding what needs to be added as "uncommitted changes"
I see now, give me a minute I'll check
The easiest would be as an artifact (I think).
Let's assume you put it into a csv file (with pandas or mnaually)
To upload (from the pipeline Task itself):task.upload_artifacts(name='summary', artifact_object='~/my/summary.csv')Then if you want to grab it from anywhere else:task = Task.get_task(task_id='HPO controller Task id here') my_csv = Task.artifacts['summary'].get_local_copy()
If you want to store as dict it might be even easier:
` task.upload_artifacts(name='summary', artifa...
curl seems okay, but this is odd https://<IP>:8010
it should be http://<IP>:8008
Could you change and test?
(meaning change the trains.conf and run trains-agent list )
Hi SubstantialElk6
I think you are absolutely correct, it seems the glue pops all the arguments, when in fact it should maybe process them a,d convert the --env/-e
What do you think?
Aloso I assume if these are the default arguments they should actually be part of the k8s apply.yaml template no ?
BTW if the plots are too complicated to convert to interactive plotly graphs, they will be rendered to images and the server will show them. This is usually the case with seaborn plots
Could you maybe send a screenshot? This is very strange? Also what's the trains version?
Hmmm that sounds like a good direction to follow, I'll see if I can come up with something as well. Let me know if you have a better handle on the issue...