data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!should be the full path, or just the file name?
just file name, this is basically fname matching
NastySeahorse61 I would try to open in incognito mode (i.e. no cookies etc.), did you also change the address of the server?
I think I understand what the issue is, you have installed the agent on your python 3.8, but it is running and trying to install on python 3.10
To verify,
pip uninstall clearml-agent
python3.10 -m pip install clearml-agent
python3.10 -m clearml-agent daemon...
Can you try to manually install it and see what you are getting?python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
replace it with:git+
No need for the repository name, this will ensure you always reinstall it (again pip feature)
Hi VexedCat68
What type of data is it? And what type of annotations?
Streaming data into the training process is great, but is it post quality control?
My main query is do I wait for it to be a sufficient batch size or do I just send each image as soon as it comes to train
This is usually a cost optimization issue, generally speaking if GPU up time is not an issue that the process is stochastic anyhow, so waiting for a batch or not is not the most important factor (unless you use batchnorm layer, in that case this is basically a must)
I would not be able to split the data into train test splits, and that it would be very expensiv...
Hi LovelyHamster1
As you noted, passing overrides in Args/overrides
, for example ['training.max_epochs=1000']
should work when running with the agent.
Could you verify with the latest RC, there was a fix to support the latest hydra versionpip install clearml==0.17.5rc5
I mean the caching will work, but it will reinstall this repository on top of the cached copy.
make sense ?
this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesnβt exist anymore in fastai2
Hmm we should definitely update the example to fastai2 API
maybe the fastai bindings in clearml package are outdated
Are you getting any scalars reported to clearml?
they also appear to be relying on the tensorboard callback which seems not to work on distributed training
Yes that is correct, usually the way it works all nodes report back to "master...
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
` @call_parse
def main(
Β Β gpus:Param("The GPUs to use for distributed training", str)='all',
Β Β script:Param("Script to run", str, opt=False)='',
Β Β args:Param("Args to pass to script", nargs=...
PricklyRaven28 basically this is the issue:
python -m fastai.launch <script>
There are multiple copies of the script running, but they are Not aware of one another.
are you getting any reporting from the diff GPUs? I'm assuming there is a hidden OS environment that signals the "master" node, so all processes can communicate with it. This is what we should automatically capture. There is a workaround the fastai.launch, that is probably similar to this one:
because fastaiβs tensorboard doesnβt work in multi gpu
keep me posted when this is solved, so we can also update the fastai2 interface,
Thanks SarcasticSparrow10 !
I'll later reply the Github issue (for better visibility)
But my initial thoughts:
(1) I think this was suggested, and hopefully we will get to implementing it, I can definitely see the value. Meanwhile you can achieve some of the functionality with the experiment table and custom columns π
(2) "Don't display the performance metric" -> isn't that important? what am I missing?
(3) Hmm you mean just extra columns?
(4) sounds like a bug
(5) is this a plotly issue?...
Seems like
Task.create
is the correct use-case then, since again this is about testing flows using e.g. pytest,
Make sense
This seems to be fine for now, ...
Sounds good! thanks UnevenDolphin73
Thanks!
fyi: This section is not necessary if you you have clearml.conf file in ~/Task.set_credentials( api_host="
", web_host="
", files_host="
", key='********************', secret='***********************' )
Let me check the code for a min
Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?
OddAlligator72 sure thing π
This should sort it out:Task.init('examples', 'train', continue_last_task=True)
If you want to continue a specific Task:continue_last_task='task_id_here'
Getting the previous model:last_checkopoint = task.models['output'][-1]
What do you think?
PungentLouse55 you can find the metrics in the "original" (aka base template) experiment.
Seems like something is not working with the server, i.e. it cannot connect with one of the dockers.
May I suggest to carefully go through all the steps here, make sure nothing was missed
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md
Especially number (4)
Sounds good to me, adding it to the to do list, probably should not be very complicated to add π
I don't know how I would be able to get the description and name?
Good point, how about doing that in code, then you have all the information and you can store it in jsons / pickle next to the data folder?
wdyt?
Hi @<1523701868901961728:profile|ReassuredTiger98>
This should have worked, and seems like conda is not fetching the correct pytorch version (even though the conda env contains the cuda version they specify)
Let's try something, reset the Task, then edit the "Installed packages" and add:
cudatoolkit==11.1.1
Then try again.
Let's see what we get.
(The idea, is that I think conda forgets it just install cudatoolkit and assumes the env is for CPU)
Hurray conda.
Notice it does include cudatoolkit , but conda ignores it
cudatoolkit~=11.1.1
Can you test the same one only serach and replace ~= with == ?
Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue π
@<1595587997728772096:profile|MuddyRobin9> are you sure it was able to spin the EC2 instance ? which clearml version autoscaler are you running ?