Skip to content

qlib.workflow: FileLock and _log_uncommitted_code both unsafe vs CWD #2252

Description

@zhaow-de

🐛 Bug Description

Two related defects in qlib.workflow, both rooted in latent assumptions about what the caller's current working directory looks like.

(1) MLflowExpManager.get_or_create_exp builds its FileLock path relative to CWD (qlib/workflow/expm.py:236). The construction is os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock") after urlparse-ing the experiment URI. For any absolute file:// URI — including the default URI that qlib.config.QSettings constructs as "file:" + str(Path(os.getcwd()).resolve() / "mlruns")urlparse returns netloc="" and path="/absolute/...", the lstrip("/") then turns the absolute path into a relative one, and os.path.join("", "absolute/...", "filelock") yields a CWD-relative path. FileLock.__enter__ mkdirs the lock's parent under whatever directory the process happens to be in, not under the URI's actual location. The intended file://... location does get populated by mlflow itself, but the qlib-side FileLock is now pointing at the wrong path, so the lock no longer serializes runs that share the URI from different working directories.

(2) MLflowRecorder._log_uncommitted_code misbehaves when CWD is not a git work tree (qlib/workflow/recorder.py:362). The natural workaround for (1) — wrapping qlib.init in tempfile.TemporaryDirectory() + contextlib.chdir(tmp) so the misplaced scaffolding lands in a throwaway tree — immediately surfaces this second weakness, which is also reachable on its own from any non-git CWD (containers, CI sandboxes, processes started from /tmp).

The git-controlled CWD that _log_uncommitted_code reaches for is not load-bearing: nothing about the experiment's correctness depends on a git repo being present. The feature is opportunistic — capture uncommitted diff as an mlflow side-artifact so a future reader can reconstruct what code produced the run. The author already anticipated git might not be available (hence the except subprocess.CalledProcessError), but the missing-repo case is handled badly along two independent axes:

  • shell=True with no stderr= capture leaks git's diagnostics to the caller's terminal. subprocess.check_output(..., shell=True) only captures stdout; stderr inherits the parent's fd 2. In a non-repo, git diff emits a multi-line usage: git diff --no-index ... banner straight to the caller's stderr, bypassing qlib's logger entirely. Any downstream consumer with a structured-stderr contract (JSONL logger, CI log parser, anything piping qlib's output through jq) gets corrupted output it cannot suppress via qlib's log-level controls.
  • No pre-check that CWD is a git repo means three failing subprocess calls and three INFO records every R.start(). Each command's CalledProcessError is caught and re-logged as INFO - "Fail to log the uncommitted code of $CWD(...) when run git ...". The wording suggests the user did something wrong, when in many legitimate cases the absence of a git repo is intentional.

Fixing (1) eliminates the most common path into (2) (downstream users no longer need the chdir(tempdir) workaround), but (2) is reachable on its own, so it's worth addressing in the same area.

To Reproduce

Primary bug — FileLock at CWD-relative path

  1. From any directory other than your intended mlruns parent (e.g. cd /tmp/somewhere-unrelated), run:

    import qlib
    qlib.init(
        provider_uri="~/.qlib/qlib_data/cn_data",
        exp_manager={
            "class": "MLflowExpManager",
            "module_path": "qlib.workflow.expm",
            "kwargs": {
                "uri": "file:///Users/<you>/some/abs/path/mlruns",
                "default_exp_name": "Experiment",
            },
        },
    )
    from qlib.workflow import R
    with R.start(experiment_name="repro"):
        pass
  2. ls the CWD afterwards: a new directory tree Users/<you>/some/abs/path/mlruns/ has been created next to where you were, containing filelock (and, depending on the run, mlflow scaffolding around it).

  3. The path logic can also be verified without invoking qlib at all:

    >>> from urllib.parse import urlparse
    >>> import os
    >>> pr = urlparse("file:///Users/me/proj/mlruns")
    >>> pr.netloc, pr.path
    ('', '/Users/me/proj/mlruns')
    >>> os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock")
    'Users/me/proj/mlruns/filelock'           # relative!
    >>> os.path.isabs(_)
    False

    The same happens for the single-slash file:/Users/me/proj/mlruns shape that QSettings.mlflow.uri produces by default.

Secondary bug — _log_uncommitted_code in a non-git CWD

  1. From any non-git directory (e.g. a fresh tempfile.TemporaryDirectory(), or simply cd /tmp), run any qlib.init + R.start() pair, e.g.:

    import qlib
    qlib.init(provider_uri="~/.qlib/qlib_data/cn_data")
    from qlib.workflow import R
    with R.start(experiment_name="repro"):
        pass
  2. Observe two symptoms on every R.start(...):

    • A multi-line usage: git diff --no-index ... banner printed to the process's stderr, bypassing qlib's logger.
    • Three INFO - "Fail to log the uncommitted code of $CWD(...) when run git ..." records emitted via qlib's logger.

Expected Behavior

Primary bug. The FileLock path should resolve to the absolute location encoded in the file:// URI, regardless of the caller's current working directory.

Secondary bug. When CWD is not a git work tree, _log_uncommitted_code should skip silently — no stderr output, no per-R.start INFO log records. The feature is an optional reproducibility hook, not a precondition for running experiments, so it should degrade quietly when its environment isn't available.

Screenshot

Stderr banner emitted by the secondary bug on every R.start(...) when CWD is not a git repo:

warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path> [<pathspec>...]

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    ...

(Plus the corresponding three INFO records Fail to log the uncommitted code of $CWD(...) when run git diff|git status|git diff --cached written via qlib's logger.)

Environment

  • Qlib version: 0.9.7 (both defects also present on main HEAD at the time of writing — expm.py:236 and recorder.py:362 are unchanged in recent history)
  • Python version: 3.12
  • OS: macOS (Darwin) verified; the same logic fails identically on Linux (POSIX urlparse behavior is the same; the secondary git-stderr leak is platform-independent)
  • Commit number (optional, please provide it if you are using the dev version): d5379c5 (main HEAD at time of report)

Additional Notes

Suggested fix for the primary bug

In qlib/workflow/expm.py:234–236, replace:

pr = urlparse(self.uri)
if pr.scheme == "file":
    with FileLock(Path(os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock"))):

with:

from urllib.request import url2pathname
pr = urlparse(self.uri)
if pr.scheme == "file":
    lock_path = Path(url2pathname(pr.path)) / "filelock"
    with FileLock(lock_path):

url2pathname is preferable to a plain Path(pr.path) because it also handles Windows drive-letter file:///C:/... URIs correctly, where the current lstrip("/") would mangle the drive separator.

Suggested fixes for the secondary bug, ordered least → most invasive

For reference, the offending code (qlib/workflow/recorder.py:362):

def _log_uncommitted_code(self):
    # TODO: the sub-directories maybe git repos.
    # So it will be better if we can walk the sub-directories and log the uncommitted changes.
    for cmd, fname in [
        ("git diff", "code_diff.txt"),
        ("git status", "code_status.txt"),
        ("git diff --cached", "code_cached.txt"),
    ]:
        try:
            out = subprocess.check_output(cmd, shell=True)
            self.client.log_text(self.id, out.decode(), fname)
        except subprocess.CalledProcessError:
            logger.info(f"Fail to log the uncommitted code of $CWD({os.getcwd()}) when run {cmd}.")
  1. Capture stderr. subprocess.run(cmd, stdout=PIPE, stderr=PIPE, check=True) (or at minimum subprocess.check_output(..., stderr=subprocess.DEVNULL)). Fixes the worst symptom — the leak that bypasses the logger.
  2. Drop shell=True. Pass ["git", "diff"] as a list. The commands here use no shell features, and removing shell=True eliminates a needless injection footgun.
  3. Pre-probe with git rev-parse --is-inside-work-tree (or check (Path.cwd() / ".git").exists()) and skip silently when CWD isn't a git work tree. Avoids three failing invocations and three INFO records when the answer is obviously no.
  4. Demote the fallback log from INFO to DEBUG. INFO is the wrong level for "we tried an optional reproducibility hook and the env doesn't support it" — the user can't act on it.
  5. Expose a QSettings.mlflow.log_uncommitted_code toggle. Users who deliberately run qlib in non-git environments can disable the feature cleanly, independent of the heuristics above.

Happy to send a PR for either or both fixes if a maintainer agrees on the direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions