Reputation
Badges 1
25 × Eureka!ShakyJellyfish91 can you check if version 1.0.6rc2 can find the changes ?
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
Thatβd include monitoring and alerting. Iβm afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow...
MysteriousBee56 Edit in your ~/trains.conf:api_server: http://localhost:8008
toapi_server: http://192.168.1.11:8008
and obliviously the same for web & files
I'll make sure we fix the trains-agent to output an error message instead of trying to silently keep accessing the API server
Getting you machine ip:
just run :ifconfig | grep 'inet addr:'Then you should see a bunch of lines, pick the one that does not start with 127 or 172
Then to verify runping <my_ip_here>
Hi @<1542316991337992192:profile|AverageMoth57>
Not sure I follow how the integration what you have in mind regarding Gerrit integration None
Sounds interesting ...
wdyt?
Hi @<1539055479878062080:profile|FranticLobster21>
hey, how do I use local files as dependencies?
You mean like a repository ?
Can I specify in task what local files do I use that should be packaged?
In a git repo?
Basically the agent can do two things, either replicate a single script or clone a git repo + uncommitted changes
How are you getting:
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
is this what you had on the Original manual execution ? (i.e. not the one executed by the agent) - you can also look under "org _pip" dropdown in the "installed packages" of the failed Task
Hi @<1734020162731905024:profile|RattyBluewhale45>
What's the clearml agent version? And could you verify with the latest RC?
Lastly how are you running the agent, docker mode? What's the bade container?
Great to hear it got solved. BTW network drives are supported but you have to make sure the mount file system supports locks (NFS does)
@<1734020162731905024:profile|RattyBluewhale45> could you attach the full Task log? Also what do you have under "installed packages" in the original manual execution that works for you?
1724924574994 g-s:gpu1 DEBUG WARNING:root:Could not lock cache folder /root/.clearml/venvs-cache: [Errno 9] Bad file descriptor
You have an issue with your OS / mount, specifically "/mnt/clearml/" is the base folder for all the cached stuff and it fails to create the lock files there either use a Local folder or try to understand what is the issue with the Host machine /mnt/ mounts (because it looks like a network mount)
Notice the error:
Cannot install albucore==0.0.13 and numpy==1.23.5 because these package versions have conflicting dependencies
what is the pip version you have configured in the clearml.conf? also can you provide the full Task log (i.e. click on Download in the web UI console tab)
SmarmySeaurchin8args=parse.parse() task = Task.init(project_name=args.project or None, task_name=args.task or None)You should probably look at the docstring π
:param str project_name: The name of the project in which the experiment will be created. If the project does
not exist, it is created. If project_name is None, the repository name is used. (Optional)
:param str task_name: The name of Task (experiment). If task_name is None, the Python experiment
...
Great to hear SourSwallow36 , contributions are always appreciated π
Regrading (3), MongoDB was not build for large scale logging, elastic-search on the other hand was build and designed to log millions of reports and give you the possibility to search over them. For this reason we use each DB for what it was designed for, MongoDB to store the experiment documents (a.k.a env, meta-data etc.) and elastic-search to log the execution outputs.
Also, I would like to add some other plots t...
Yes, that sounds like a good start, DilapidatedDucks58 can you open a github issue with the feature request ?
I want to make sure we do not forget
yes, so you can have a few options π
RipeGoose2
HTML file is not a standalone and has some dependencies that require networking..
Really? I thought that when jupyter converts its own notebook it packages everything into a single html, no?
Hi @<1724960468822396928:profile|CumbersomeSealion22>
As soon as I refactor my project into multiple folders, where on top-level I put my pipeline file, and keep my tasks in a subfolder, the clearml agent seems to have problems:
Notice that you need to specify the git repo for each component. If you have a process (step) with more than a single file, you have to have those files inside a git repository, otherwise the agent will not be able to bring them to the remote machine
Yes in the UI, clone or reset the Task, then youcan edit the installed packages section under the Execution tab
Hi @<1571308003204796416:profile|HollowPeacock58>
I'm assuming this is the arm support (i,e, you are running on new mac) fix we released in one one of the last clearml-agent versions. could you update to the latest clearml-agent?
pip3 install clearml-agent==1.6.0rc2
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
Hi @<1603198134261911552:profile|ColossalReindeer77>
Hello! does anyone know how to do
HPO
when your parameters are in a
Hydra
Basically hydra parameters are overridden with "Hydra/param"
(this is equivalent to the "override" option of hydra in CLI)
Could you disable the windows anti-virus firewall and test?
You do not need the cudatoolkit package, this is automatically installed if the agent is using conda as package manager. See your clearml.conf for the exact configuration you are running
https://github.com/allegroai/clearml-agent/blob/a56343ffc717c7ca45774b94f38bd83fe3ce1d1e/docs/clearml.conf#L79
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?
You should manually remove the cudatoolkit from the installed packages section in the UI, then try to send it to the agent and see if it works. The question is how it ended there in the first place
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...