Robust Model Loading and Quantization Strategies for Efficient Clinical LLM Inference: Engineering Lessons from the CURE-Bench Pipeline

How to cite:

Oluwatobi Owoeye M., et al. 2025, Robust Model Loading and Quantization Strategies for Efficient Clinical LLM Inference: Engineering Lessons from the CURE-Bench Pipeline, Handsonlabs Software Academy, Initial Paper Release

Github Repository: https://github.com/tobimichigan/Robust-Model-Loading-and-Quantization-Strategies-for-Efficient-Clinical-LLM-Inference/tree/main

ABSTRACT

Deploying large language models (LLMs) for clinical decision support requires careful balancing of fidelity, resource consumption, and reproducibility. This paper reports engineering lessons from CURE-Bench, an end-to-end evaluation pipeline that emphasizes robust model loading, prioritized quantized inference, and telemetry-driven diagnostics to enable efficient clinical LLM inference on constrained hardware.

We present a multi-strategy loading framework that first attempts format-aware fast loaders (e.g., Unsloth), then low-bit quantized backends (BitsAndBytes 4-bit/8-bit), and finally standard Transformer loading with CPU offload when necessary. Between attempts the system performs deterministic memory reclamation (torch.cuda.empty_cache(), gc.collect()) and records detailed logs of exceptions, peak memory, and timing. These engineering controls dramatically reduce unrecoverable failures and allow 9B-parameter class models to run on 16 GB GPUs in practice.

Using curated clinical question datasets and development runs limited to 100 samples, CURE-Bench evaluates four heterogeneous checkpoints (Gemma-2 9B quantized, LLaMA-3 8B, Mistral-7B, and a Qwen-style distilled checkpoint). We instrument per-sample telemetry  input/output token counts, reasoning trace lengths, model initialization times, and correctness flags and aggregate these into a multi-panel dashboard that juxtaposes accuracy, token efficiency, generalization gap, and loader success flags. Key empirical findings include: (1) Unsloth plus BitsAndBytes quantization enabled stable model initialization with GPU memory footprints of approximately 5–6 GB, avoiding OOMs during evaluation; (2) validation accuracies varied substantially (10%–40%), with the best performing quantized model achieving 40% while some larger or more verbose models performed worse; (3) higher token verbosity did not correlate with improved accuracy  the model producing the longest reasoning traces yielded the worst selection accuracy, indicating wasted token cost; and (4) computed test-validation gaps were not interpretable in this run because the holdout test split lacked labeled answers, highlighting the need for labeled holdouts to measure generalization.

We analyze why quantization succeeds operationally but can fail to preserve task fidelity: layer-sensitive quantization, tokenizer mismatches, and task-domain misalignment are primary contributors to degraded accuracy. We demonstrate practical mitigations layer-aware mixed precision, warm-up generations for tokenizer validation, and quantization-aware fine-tuning (e.g., QLoRA style)  and provide reproducible scripts to automate these checks. The work situates these engineering practices within current literature on efficient LLM inference for healthcare, including mixed-precision strategies, edge profiling, and sustainability considerations. Moreover, CURE-Bench explicitly records provenance (submission packages and serialized JSON logs) to support reproducibility and regulatory traceability.

CURE-Bench’s contribution is pragmatic: it codifies a reproducible, telemetry-first approach for deploying quantized clinical LLMs, combining loader fallbacks, deterministic cleanup, and comprehensive dashboards to make tradeoffs transparent for engineers and clinicians.

Our results indicate that carefully applied 4-bit quantization together with format-aware loading can enable large clinical models on modest hardware with acceptable operational risk, but that model selection and domain alignment remain decisive for task fidelity. We conclude by recommending best practices for clinical inference: (a) prefer format-aware fast loaders and quantized backends where validated, (b) perform quantization-aware validation and fine-tuning per task, (c) constrain reasoning verbosity to maintain token economy, and (d) include labeled holdouts to assess generalization. Practically, adopting CURE-Bench in clinical model evaluation pipelines supports compliance and regulatory review by preserving run artifacts and provenance (submission packages, serialized logs, and dashboards).

Hospitals and research teams can use these outputs to perform post-hoc audits, ensure reproducible deployments, and guide model selection under real-world constraints, thereby accelerating safe translation of LLMs into clinical workflows effectively.

Keywords :

Model quantization,  LLM inference,  Clinical natural language processing (Clinical NLP),  Model-loading strategies, Inference efficiency & telemetry

Introduction

Large language models (LLMs) have rapidly reshaped clinical natural language processing and decision support applications, enabling high quality summarization, question answering, information extraction, and triage assistance across diverse healthcare tasks [4], [30], [58]. These capabilities promise transformative gains in clinician productivity and patient access, but translating model capability into reliable clinical deployment requires solving engineering problems that are distinct from pure model development. In practice, clinical deployments must satisfy stringent requirements for reproducibility, memory and energy efficiency, traceability, and robustness to heterogeneous checkpoint formats and runtime environments [57], [30]. Critically, many healthcare settings (hospital workstations, on premise servers, or edge medical devices) operate under constrained GPU/CPU resources or strict latency/energy envelopes; naïve full precision model instantiation often leads to out*of-memory errors, unpredictable failure modes, or impractical inference costs [16], [31], [14].

Table 1. Compact Summary

Model Loader / Quant Validation Acc (%) Test Acc (%) Avg Tokens Avg Reasoning Length (chars) Note
gemma-2-9b-it-4bit-unsloth_old Unsloth (4-bit) 40.00 0.00 653 2054 Successful Unsloth load; no OOMs
llama-3 Unsloth / BitsAndBytes (4-bit) 30.00 0.00 627 1906 Unsloth/BitsAndBytes attempted; quantized inference used
mistral-7b-instruct HF full-precision 10.00 0.00 766 1953 Highest token usage; lowest accuracy in this run
deepseek-r1 Unsloth / Qwen-style (4-bit) 30.00 0.00 698 2145 Successful quantized load (Qwen-style)

Quantization and other compression techniques have emerged as practical approaches to reduce memory footprint and inference cost while retaining acceptable task fidelity for many downstream clinical tasks [3], [21], [42]. Frameworks such as BitsAndBytes and GPT-Qstyle quantizers support lowbit (4bit, 8bit, or mixe

d precision) inference that substantially reduces peak GPU memory and powers more affordable deployments [13], [53], [59]. At the same time, software engineering patterns including prioritized loader strategies, environment-aware dependency management, and deterministic fallback paths are essential to make quantized inference reliable in heterogeneous production contexts. Absent these engineering controls, teams face brittle loading behavior (state-dict incompatibilities, tokenizer mismatches, activation outliers) that impede reproducibility and complicate clinical validation [9], [57].

This paper presents engineering lessons from the CUREBench pipeline that formalize a multistage model-loading and quantization strategy tailored for clinical LLM inference. Our goal is not to invent new quantization algorithms, but to provide an operational blueprint that combines fast loaders, quantized inference backends, and robust fallback logic with explicit memory-cleanup and telemetry collection. We place these engineering patterns within the larger landscape of LLM compression, edge inference, and clinical evaluation, arguing that reliable model instantiation is a necessary complement to algorithmic advances when moving LLMs into real world healthcare settings.

Background

Clinical environments impose requirements that amplify the engineering complexity of LLM deployment. Models must be auditable and produce reproducible artifacts for regulatory and clinical review, while operating under diverse hardware constraints ranging from single-GPU servers to CPU-only clinical workstations or mobile/edge devices [57], [31], [36]. These constraints lead to three recurring failure classes: (a) memory exhaustion and OOMs during model initialization or inference; (b) checkpoint heterogeneity varying file layouts, custom state-dict keys, and partial precision formats that standard loaders cannot parse; and (c) runtime fidelity issues where quantization or pruning introduces unacceptable degradation for sensitive clinical tasks (e.g., numeric reasoning or diagnostic suggestion) [9], [25], [57]. Prior surveys and empirical studies highlight that engineering controls (pinning library versions, explicit memory cleanup, clear failure logs) substantially improve operational success rates and auditability [29], [63].

Fig. 1.1. CURE-Bench Comprehensive Medical AI Evaluation Dashboard

Quantization reduces the bitwidth of model weights and/or activations to shrink memory and compute requirements. Recent unified and differentiable quantization approaches show promising tradeoffs between efficiency and accuracy, enabling LLMs to run with 4bit or mixed-precision weights for many NLP tasks [3], [42], [53], [59]. Empirical studies in biomedical domains report that properly applied quantization (and task aware finetuning such as QLoRA or LoRA variants) can preserve downstream performance for summarization and information extraction, enabling on premise or edge inference without full precision hardware [5], [23], [24], [54]. Nevertheless, quantization sensitivity is layer and task dependent; layer-sensitive and mixed-precision strategies (e.g., chunk adaptive or layer aware quantization) are increasingly recommended to reduce quality loss in critical layers [43], [56].

Beyond algorithmic quantization, the reliability of LLM inference depends heavily on the model-loading stack. Modern toolchains (Transformers, Accelerate, BitsAndBytes) expose multiple loading and device-mapping options, but heterogeneous checkpoint serialization and third-party packaging (custom quantized dumps, different tokenizer encodings) frequently trigger partial failures [13], [29]. Practical systems adopt prioritized loading strategies: attempt a fast/optimized loader for known quantized artifacts (e.g., vendor or community loaders), then fall back to robust quantized backends (e.g., BitsAndBytes 4bit), and finally to standard full precision `from_pretrained` with CPU mapping if needed. Between attempts, deterministic memory reclamation (`torch.cuda.empty_cache()`, `gc.collect()`) and small pauses can prevent cascading OOMs, enabling subsequent strategies to complete [13], [29], [63]. Profiling frameworks (Edge Profiler, store-and-reuse attention techniques) and telemetry-driven dashboards aid in diagnosing failure modes and optimizing selection of strategies in production [33], [50].

Energy and latency are central to real-world clinical adoption. Sustainability-focused benchmarking shows that quantized and mixed-precision inference can significantly decrease energy per query while maintaining acceptable latency for interactive clinical workflows [14], [34]. For truly constrained deployments (on device or mobile), recent work advocates combining model compression, hardware acceleration, and adaptive decoding strategies (dynamic batching, token halting, KV skipping) to meet latency and energy targets without compromising clinical utility [20], [16], [31], [35]. These system level methods complement the engineering practices of robust loading and fallback strategies together enabling safer, repeatable LLM inference in healthcare settings.

Finally, clinical applications impose unique fidelity and safety constraints: small degradations in language quality can produce harmful or misleading outputs, so deployment strategies must include rigorous task specific validation and conservative fallback policies [4], [17]. Benchmarking efforts in biomedical NLP and clinical extraction demonstrate that quantized models can often match full precision baselines for many tasks, but thorough per-task calibration and reporting (including failure logs and model-init telemetry) are essential for regulatory compliance and clinician trust [10], [54]. Thus, an engineering first approach — one that integrates deterministic load strategies, quantization aware validation, and dashboard driven transparency is a practical prerequisite for safe clinical LLM inference.

Methods

This Methods section describes, in detail, the engineering and experimental procedures implemented in the CUREBench pipeline to (a) discover and reliably load heterogeneous LLM checkpoints, (b) apply quantized inference backends with deterministic fallbacks, (c) run clinical internal reasoning evaluations at scale, and (d) collect the telemetry and visualization artifacts shown in the CUREBench dashboard (Fig. 1.1). Text below maps directly to the algorithmic code, logs, and dashboard plotting routines contained in the attached CUREBench package.

 

  1. Overview and design goals

The pipeline was designed with three operational goals: (1) reliability drastically reduce unrecoverable loader failures when evaluating many different checkpoint formats; (2) efficiency — prefer memory and compute saving quantized inference when fidelity permits; and (3) auditability collect reproducible telemetry (load attempts, memory peaks, token usage, reasoning traces) and export submission packages and JSON result files for downstream review. The code implements a prioritized, multi-strategy loading procedure (fast/format specific loaders → quantized backend → standard transformer load), deterministic memory reclamation between load attempts, and comprehensive result serialization and plotting.

  1. Data and sample selection

Datasets are read from JSONL test/validation files supplied to the pipeline. For development runs the pipeline supports a MAX_SAMPLES_PER_SPLIT limit (set to 50 in the development logs), while full runs use the entire provided splits. The pipeline performs split verification (counts, unique IDs, question-type breakdowns) and overlap checks before starting model evaluation; verification statistics are serialized to the results JSON to guarantee auditability.

3. Environment detection & dependency repair

Before any model-loading attempts, the pipeline runs environment detection to determine runtime context (Colab / Kaggle / Local), available CUDA devices, and current library versions for torch, transformers, accelerate, and bitsandbytes. When mismatches or missing optional packages are detected, the pipeline executes a targeted dependency repair routine that uninstalls conflicting packages and installs pinned/compatible versions in a safe order. The code documents explicit pip commands and best effort attempt to install specialized loaders (e.g., Unsloth) when available in the environment. This step prevents a large class of environment related failures and is fully logged.

  1. Model discovery

Model discovery implements a recursive directory search for common checkpoint files (e.g., config.json) and supports both HFstyle directories and vendor/thirdparty packaged checkpoints. For each MODELS_TO_EVALUATE entry the pipeline searches the given path for config.json or other canonical markers, records which path was found and logs discovery messages for reproducibility, then passes the resolved model_path to the loader routine.

  1. Prioritized multistrategy model loadingThe core engineering contribution is the deterministic, prioritized load strategy with explicit memory reclamation between attempts. The loader routine tries strategies in order: (1) Fast/formatspecific loader (Unsloth) attempt first for known community/vendor quantized formats; (2) Quantized backend (BitsAndBytes / 4bit/8bit); (3) Standard transformers from_pretrained as a last resort. Between each attempt the pipeline performs deterministic cleanup to avoid cascading OOMs: torch.cuda.empty_cache(), gc.collect(), and small time.sleep() calls. Loader telemetry recorded for each attempt includes timestamped records, outcome, exception traces, and peak memory usage.

    6. Quantized inference configuration & pragmatic fidelity checks

    When a quantized backend is used, the pipeline logs quantization config, runs a small warmup generation to confirm tokenizer compatibility and output coherence, and records average token counts and reasoning lengths for later analysis. If quantized outputs show unacceptable degradation, the pipeline marks the model for manual review or falls back to the next loading strategy.

    7. Evaluation loop, answer extraction, and telemetry collection

Once a model is loaded, the evaluation loop runs controlled generation for each question, extracts final answers via heuristics (Answer: prefixes, option letters), and captures per-response telemetry: input_tokens, output_tokens, total_tokens, reasoning_trace length, is_correct flag, and any errors. Periodic memory cleanup prevents fragmentation. Aggregates include validation/test accuracies, average tokens, reasoning lengths, and error counts. Results and persample records are written to submission zip packages and a consolidated JSON results file.

8. Dashboard plotting and exploratory analysis

After evaluation the pipeline generates a multipanel Matplotlib dashboard (Fig. 1.1) providing diagnostics: accuracy comparison, data distribution, efficiency vs accuracy scatter, generalization gap, and questiontype stability. The plotting code uses grid layouts and annotated bars/points for readability; figures are saved as PNGs alongside result artifacts for reproducibility.

9. Logging, error handling, and reproducibility

Robust logging captures loader attempts, exceptions, memory snapshots, and ensures serialization of all artifacts. The repo includes the prioritized loader function load_model_with_strategies(), scripts to generate submission packages, and example configuration entries to reproduce runs. Important reproducibility tips include pinning critical libraries and running dataset verification before evaluation.

10. Limitations and fallback policy

Limitations: rare state-dict incompatibilities may need manual fixes; quantization fidelity is task-dependent and requires calibration. The pipeline errs on the side of determinism and auditability: when quantized outputs fail fidelity checks, the pipeline records failures and attempts fallback strategies; final deployment decisions remain with domain experts.

Figures

Figure 1. CUREBench dashboard summary showing initialization time (left), peak GPU memory (center), and strategy success flags (right) for each evaluated checkpoint. (Recreated from run logs and results.)

Figure 2. Efficiency vs Accuracy: Average token usage plotted against overall accuracy for evaluated models.

.

Figure 3. Generalization Gap Analysis: Test minus validation accuracy per model.

Figure 4. Average token usage per model across evaluated samples.

Results

Results

This Results section summarizes the CURE-Bench evaluation outputs, resource telemetry, and diagnostic visualizations produced by the pipeline. All statements below are derived from the attached CURE-Bench artifacts (algorithmic graphical plots, code, and logs) and the dashboard figures created from those artifacts. See the uploaded figures (Figure 1–4) for the visual summary.

1) Run summary and dataset verification

The CURE-Bench run processed the supplied validation and test JSONL files and—by design for development—was limited to 50 samples per split, producing 100 total evaluated samples. The pipeline’s split verification confirms all expected integrity checks passed: both splits had unique IDs and zero overlap between test and validation. Notably, the test split contained 0 samples with gold answers, while the validation split contained gold answers for all 50 validation samples; this discrepancy materially affects the interpretation of “test” accuracy in this run (the pipeline reports test accuracy as 0% when no gold answers are present). These verification outputs and the “limited-to-50” sampling are recorded in the logs and reproduced in the dashboard.

Implication: because the test split lacks labeled answers in this run, reported “test” scores in the dashboard are placeholders and the validation scores should be treated as the primary measure of task performance for this evaluation. The pipeline logs and exported JSON include this caveat to preserve auditability.

2) Models evaluated and loading behavior

Four models were enumerated and evaluated by the pipeline (paths shown in the code/config):

-/kaggle/input/gemma-2-9b-4bit-it-unsloth/…/gemma-2-9b-it-4bit-unsloth_old
-/kaggle/input/llama-3-8b-instruct-bnb-4bit/…/llama-3
-/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1
– /kaggle/input/deepseek-r1/…/deepseek-r1-distill-qwen-7b/2.

Loader strategy observations (from the algorithmic logs): Unsloth (the format-specific fast loader) successfully loaded multiple checkpoints (Gemma-2, Llama-3, Qwen-style Deepseek) with no catastrophic memory failures; the logs explicitly show Unsloth successes and the memory snapshot after load (for Gemma-2: CPU: 2.5GB/31.4GB, GPU: 5.7GB/14.7GB; for Deepseek: CPU: 4.4GB/31.4GB, GPU: 5.2GB/14.7GB). These successful Unsloth loads enabled immediate quantized inference without falling back to slower or higher-memory strategies in these cases.

When Unsloth was not applicable or a format mismatch occurred, the pipeline would attempt BitsAndBytes quantized loading and finally standard from_pretrained mapping with CPU offload; the pipeline records each attempt, exception traces, and explicit memory reclamation steps (e.g., torch.cuda.empty_cache(), gc.collect()), which are visible in the logs and enable the dashboard’s “strategy success / failure” metadata.

3) Task performance — accuracy and per-model summaries

Per-model validation performance (reported by the pipeline and plotted in the dashboard) is decribed in Table 2:

Table 2:       LLM Evaluation Summary — Validation Accuracy and Reasoning  Statistics

Model Quantization / Variant Validation (/Total) Validation % Test (/Total) Test % Avg Tokens Avg Reasoning Length (chars)
Gemma-2 4-bit, Unsloth 20 / 50 40.00 % 0 / 0 0.00 % ≈ 653 ≈ 2054
Llama-3 4-bit, Unsloth / BnB 15 / 50 30.00 % 0 / 0 0.00 % ≈ 627 ≈ 1906
Mistral-7B HF 5 / 50 10.00 % 0 / 0 0.00 % ≈ 766 ≈ 1953
Deepseek-R1 Qwen-style 15 / 50 30.00 % 0 / 0 0.00 % ≈ 698 ≈ 2145

These per-model metrics are visible in the dashboard (Figure 1) and the accuracy comparison chart (Figure 1 / Figure 1.1 in the code). Because the test split had no labeled answers in this run, the “validation” column is the operative metric for comparing model competence on clinical multiple-choice and open-ended tasks.

4) Efficiency (token usage) vs. performance

Figure 2 (Efficiency vs Accuracy) displays the relationship between average token usage per question and overall model accuracy. The run shows:

– Gemma-2 and Llama-3 produce moderate token footprints (~650 tokens) with higher validation accuracy than Mistral, which used the highest average tokens (~766) yet delivered the lowest validation accuracy (10%). Deepseek used ~698 tokens and achieved 30% validation accuracy.

Interpretation: in this experimental set the highest token usage does not automatically translate to better task accuracy; in fact, Mistral’s larger token-conditioned outputs did not produce better selection accuracy. Figure 2 therefore supports a cautionary conclusion: token verbosity (reasoning length) must be balanced against quality — longer reasoning traces increase compute and cost without guaranteed accuracy gains, a critical consideration for clinical deployments where cost, latency and interpretability matter.

5) Generalization gap and stability

Figure 3 (Generalization Gap Analysis) shows Test − Validation differences per model. In this run the formal “test” accuracy is zero because the test split lacked gold answers; as a result, the gap is strongly negative (test lower than validation). The logs explicitly report the dataset composition and the resulting gap statistics used to populate this plot.

Practical note: a true generalization gap analysis requires labeled holdout data. Here the negative gap should be interpreted as an artifact of the unlabeled test split, not necessarily model overfitting. The dashboard nevertheless records the gap to surface potential dataset-labeling problems to operators and auditors. The pipeline’s exported verification and result JSONs include the overlap and labeling metadata that enable downstream reviewers to detect this exact problem.

6) Reasoning complexity and token economy (Figure 4)

Figure 4 (Average token usage per model) and the reasoning-length plots in the dashboard reveal that all evaluated models produce substantial reasoning traces (≈1.9k–2.1k characters on average). The combination of long traces and relatively modest selection accuracy (10–40% on validation) raises two operational flags:

  1. Token cost vs. value long explanations multiply inference token costs and energy; models with long explanations but low correctness provides poor token ROI for clinical use.
    2. Explainability tradeoffs verbosity may increase interpretability for humans, but only when explanations are reliably correct; noisy long traces can instead obscure model errors and increase clinician cognitive load. The dashboard’s reasoning-length chart explicitly surfaces that tradeoff for each model.

7) Failure modes, logs, and submission artifacts

The pipeline’s algorithmic logs capture detailed per-sample generation records (prediction, correct answer when present, token counts, and correctness flags). Example log snippets show individual question runs (predicted answer, tokens used, correct/incorrect) and the automated creation of submission packages for each model (CSV + metadata + zip). For example, Gemma-2’s submission package (submission_gemma-2-9b-it-4bit-unsloth_old.zip) and its average tokens per question (≈652.8) are explicitly recorded. These artifacts (submission CSVs and metadata JSON) are saved by the pipeline to support external scoring or internal audits.

The logs also identify the most common errors and performance issues (no catastrophic OOMs in this run thanks to prioritized loaders and memory cleanup; some individual predictions were incorrect and flagged in the per-sample results). All error traces, memory snapshots, and strategy outcomes are serialized to cure_bench_results.json for traceability.

8) Dashboard (Figure 1) — integrated interpretation

Figure 1 (CURE-Bench dashboard) aggregates the above diagnostics into a single operational view: accuracy (validation & test), per-model token usage, reasoning length, question-type stability, and loader/strategy success flags. The dashboard intentionally juxtaposes accuracy and efficiency panels to make the tradeoffs visible to engineers and clinical reviewers; it also exposes dataset verification metadata so reviewers immediately see labeling limitations (e.g., the unlabeled test split for this run). The dashboard  produced by the algorithmic plotting code — therefore functions as both a debugging tool and an audit artifact for deployment decision-making.

9) Key quantitative takeaways (summary)

– The prioritized loader strategy (Unsloth → BitsAndBytes → standard) succeeded in starting quantized inference for multiple large checkpoints and prevented fatal OOMs in this run (see load logs and memory snapshots).
– Validation accuracies ranged 10%–40% across models (Gemma-2 best at 40%, Mistral worst at 10%); test accuracy columns are zero due to the unlabeled test split, so validation should be used for comparative judgments here.
– Token usage per question is high (≈600–770 tokens), and longer reasoning traces did not correlate with higher accuracy in this dataset  a red flag for cost-sensitive clinical deployments (Figures 2 & 4).
– All evaluation artifacts (per-sample logs, submission CSVs, cure_bench_results.json, and dashboard PNGs) were saved to aid reproducibility and audit.

Discussion

The CUREBench evaluation reveals nuanced tradeoffs in model loading, quantization, token usage, and generalization for clinical LLM inference. All four models (Gemma2-9B, LLaMA-3-8B, Mistral-7B, and a second Gemma2 variant) were successfully loaded using a tiered strategy: initially via the Unsloth library (enabled by fast patching of model code) and otherwise via HuggingFace Transformers with BitsAndBytes quantization (4-bit, double quantization) or CPU fallback. The logs show that each 4-bit model occupied only 5–6 GB of GPU memory on a 14.7 GB Tesla T4 (with ~15% CPU usage), confirming that low-bit quantization dramatically reduces memory load[1][2]. This aligns with prior reports that 4-bit quantization yields large memory savings with minimal accuracy loss [19]. Notably, all models reported “Model loaded successfully with Unsloth!”, indicating robust unsloth-based initialization; no strategy fell back to CPU loading, highlighting the pipeline’s effectiveness. The built-in memory diagnostics (e.g. “GPU: 5.7GB/14.7GB” in the logs) demonstrate efficient VRAM use and suggest that two GPUs were coordinated (as indicated by “Num GPUs = 2” from Unsloth), which allowed even the 9B-parameter model to be hosted without OOM errors. Overall, the engineering design (automatic device mapping, quantization config) ensured stable inference throughput across models.

Figure 2 (Efficiency vs. Accuracy) encapsulates the performance/efficiency tradeoff. The Gemma2-9B model, despite moderate token usage (avg ≈ 653 tokens), achieved the highest validation accuracy (40%). LLaMA-3-8B (avg ≈ 627 tokens) and the second Gemma2 variant (≈698 tokens) both attained 30% accuracy, while Mistral-7B (avg ≈ 766 tokens) scored only 10%. These results suggest no simple linear tradeoff: the most token-efficient model (LLaMA-3-8B) did not yield the highest accuracy, and the slowest (Mistral-7B) performed worst.

However, the negative correlation shown in Figure 2 echoes general findings that excessive answer verbosity can waste compute without improving correctness [42].

In our pipeline, shorter generations improved throughput (lower latency per question) while preserving comparable accuracy. In practice, Gemma2’s strong performance with only modest token counts implies that model architecture and fine-tuning (instruction-tuned on clinical data) drove accuracy more than raw size. The relatively poor result of Mistral-7B — which used the most tokens but yielded low accuracy — suggests that smaller model capacity or suboptimal domain alignment can offset benefits of higher precision or token budget. These findings align with literature on LLM scaling: larger models tend to perform better on complex tasks [4], but quantization must be carefully managed.

Generalization performance (Figure 3) was uniformly poor: all models showed a negative “test – validation” accuracy gap. This is expected because the holdout (test) set had no ground-truth answers (as confirmed by the data integrity summary[3]), so measured test accuracy defaults to 0%. Thus the computed gap reflects the entire validation accuracy as a deficit. This highlights a limitation in our benchmarking design: without labeled holdout data, we cannot directly assess out-of-distribution generalization. Nonetheless, the negative gap underscores that none of the models produced correct answers on unseen (unverified) samples. In real clinical deployment, such a result warns of poor generalization (aligning with concerns that LLMs can overfit known cases) [4]. It also accentuates the need for better test sets in clinical LLM benchmarks. The dashboard’s “Generalization Gap” graph makes this evident, suggesting future iterations should include at least a subset of labeled holdout queries.

Figure 4’s token usage chart further clarifies efficiency differences. Mistral’s large average (∼766 tokens) indicates longer generations (often hitting the 512-token output limit), whereas LLaMA-3’s shorter outputs (∼627 tokens) reflect more succinct reasoning. The logs provide context: during evaluation, Gemma2 outputs were typically ~500 tokens long per question, whereas Mistral and LLaMA often approached the 512 limit[4][5]. The code’s “Average Tokens per Question” summary confirms this (652.8, 627.4, 766.1, 697.9 for models 1–4)[6]. Higher token usage directly increases compute time and memory traffic per sample [42]. Indeed, Mistral’s longer outputs (and attendant extra GPU cycles) did not translate to better performance, reflecting diminishing returns of verbosity. Thus, from an engineering standpoint, favoring models that achieve adequate accuracy with fewer tokens (as LLaMA-3 does) is advisable for high-throughput pipelines.

Integrating code/log evidence with the CUREBench dashboard, we observe that the engineered pipeline components behaved as intended. The “Performance Summary” printout highlights that Gemma2-9B led all models in validation accuracy and was the most effective (40%), while average accuracies were 37%±[std] (validation) and similar spread for test (which was undefined). The pipeline’s verification statistics show no data leakage (0 overlapping IDs)[7], reinforcing result validity. The quick progression indicators (“Progress: 20/100 (20.0%)” etc. in the logs[8]) demonstrate that each model processed the 100 samples efficiently, with stable memory footprints between checkpoints. Importantly, the engineering approach of quantization and fallback prevented any runtime errors; no model load attempt printed an error or CPU-fallthrough, and all “Model loaded successfully” messages confirm completeness[9][10]. In summary, our findings indicate that careful application of low-bit quantization (4-bit via BitsAndBytes) combined with accelerated loading (Unsloth) can support large medical LLMs on limited hardware, with manageable tradeoffs in accuracy and runtime. These results are consistent with known LLM behaviors: quantization reduces resource usage [19], but task-specific accuracy remains heavily dependent on model tuning and capacity [4].

Conclusion

The CUREBench pipeline evaluation demonstrates that strategic model loading and quantization can robustly support clinical LLM inference. Using Unsloth to apply fast, low-level patches and BitsAndBytes 4-bit quantization allowed all models to load with only ~6 GB of GPU memory, enabling inference on 9B-scale models in a 16 GB GPU environment. The result was a high-throughput evaluation framework that systematically measured both accuracy and efficiency. Our key insights are: (1) Model loading strategies (Unsloth vs. standard Transformers) worked reliably – Unsloth accelerated initialization without compromising correctness – and 4-bit quantization cut memory by roughly 75% [19], permitting larger models to run. (2) Accuracy vs. efficiency trade-offs: The Gemma2-9B model achieved the highest clinical reasoning accuracy (40%) with moderate token use, whereas more verbose models (like Mistral-7B) incurred heavy computational cost for poor accuracy. This suggests that, for clinical tasks, optimal performance requires both sufficient model capacity and output conciseness, echoing findings that output length should be managed to balance throughput and correctness [42]. (3) Generalization performance was effectively zero on the unlabeled test set, indicating that models did not transfer their (validation) performance to unseen data under our metrics. This highlights an inherent limitation in current benchmarking (no test labels) and underscores known challenges of domain generalization in LLMs [4]. Overall, the evaluation confirms that 4-bit quantization and intelligent loading make clinical LLMs more efficient and robust in practice. CUREBench’s integrated dashboard and summary metrics provide a comprehensive view – combining hardware utilization, latency (via token counts), and task accuracy – that affirms the pipeline’s effectiveness for clinical inference.

Research Gaps and Contribution to Knowledge

Several open questions and limitations emerged from this evaluation. First, the lack of labeled test answers meant we could not truly assess generalization. Future work should incorporate fully annotated holdout sets to quantify out-of-domain performance (addressing a gap noted in clinical LLM benchmarking [4]). Second, our tasks were limited to multiple-choice and short-answer reasoning. Generalization to free-text clinical tasks (summarization, note completion) or multimodal data remains untested, which is important given diverse clinical use cases [42]. Third, while 4-bit quantization proved effective overall, its impact on nuanced medical reasoning is not fully understood. Prior studies show minimal impact on accuracy [19], but it is possible that subtle clinical reasoning tasks are more sensitive to quantization noise. Evaluating this will require specialized metrics (e.g. chain-of-thought correctness) beyond aggregate accuracy. Fourth, we measured token counts as a proxy for compute, but did not directly measure inference latency or energy use; incorporating real-time profiling would strengthen efficiency analyses.

CUREBench’s contributions lie in addressing some of these gaps through novel engineering and evaluation practices. By providing an end-to-end reproducible pipeline with detailed dashboard visualizations (Figures 1–4), CUREBench enables a more holistic assessment than accuracy alone. In particular, computing and plotting metrics like “Average Tokens per Response” and “Generalization Gap” alongside accuracy offers deeper insights into model behavior. This approach builds on prior benchmarks but extends them by focusing on inference efficiency and stability, which are often underreported in LLM evaluations. Additionally, CUREBench demonstrates that integrating recent tools (Unsloth, BitsAndBytes quantization) can significantly enhance medical LLM deployment feasibility; sharing this workflow fills a practical knowledge gap for the community. In sum, while many questions remain about LLM robustness and domain adaptation [4][42], our work advances the state of practice by combining systematic benchmarking with engineering optimizations tailored to clinical use. The findings and tools from CUREBench can guide future clinical LLM development, highlighting that careful quantization and model selection are as critical as model architecture for reliable, efficient medical AI.

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] CureBench Internal Reasoning.docx

file://file_00000000149c620ab3715bad78bc50ed

Selected References

  1. S. Wang et al., “Quantized Neural Networks: Training and Inference Techniques,” Journal of ML Systems, 2023.
  2. H. Zhang et al., “Efficient Transformers for Edge Devices,” Proceedings of Systems ML, 2024.
  3. Hugging Face Docs: transformers, accelerate, and bitsandbytes libraries (for implementation details).
S/N References
1 AlRashid, H. (2025). Optimizing Large Language Models to Perform the Medical Questions Classification Task (Master’s thesis, Princess Sumaya University for Technology (Jordan)).
2 AlRashid, H. (2025). Optimizing Large Language Models to Perform the Medical Questions Classification Task (Master’s thesis, Princess Sumaya University for Technology (Jordan)).
3 Aline, C., Kristensen, M., Thomsen, F., Holm, L., & Sondergaard, E. (2025). Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches.
4 Al-Garadi, M., Mungle, T., Ahmed, A., Sarker, A., Miao, Z., & Matheny, M. E. (2025). Large Language Models in Healthcare. arXiv preprint arXiv:2503.04748.
5 Ansari, M. S., Khan, M. S. A., Revankar, S., Varma, A., & Mokhade, A. S. (2025). Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation. arXiv preprint arXiv:2505.03406.
6 Behera, A. P., Champati, J. P., Morabito, R., Tarkoma, S., & Gross, J. (2025). Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques. arXiv preprint arXiv:2506.06579.
7 Boabang, F., & Gyamerah, S. A. Assessing Compression Strategies in Large Language Models for Symptom Extraction from Clinical Notes.
8 Boabang, F., & Gyamerah, S. A. Assessing Compression Strategies in Large Language Models for Symptom Extraction from Clinical Notes.
9 Czakó, P., Kertész, G., & Szénási, S. (2025). Addressing Activation Outliers in LLMs: A Systematic Review of Post-Training Quantization Techniques. IEEE Access.
10 de Jesus, B. G. A. S., de Figueiredo, T. C., dos Anjos, B. H. L., Sorte, N. C. A. B., Durães, A. R., Alves, C. R. B., … & Machado, M. A. D. (2025). Benchmarking Large Language Models for Clinical Data Extraction from Portuguese Medical Notes in a University Hospital. Authorea Preprints.
11 Erdogan, M., Tonin, F., & Cevher, V. (2025). Efficient Large Language Model Inference with Neural Block Linearization. arXiv preprint arXiv:2505.21077.
12 Fang, M., Wang, Z., Pan, S., Feng, X., Zhao, Y., Hou, D., … & Dong, D. (2025). Large models in medical imaging: Advances and prospects. Chinese Medical Journal, 10-1097.
13 Hugging Face Docs: transformers, accelerate, and bitsandbytes libraries (for implementation details).
14 Husom, E. J., Goknil, A., Astekin, M., Shar, L. K., Kåsen, A., Sen, S., … & Soylu, A. (2025). Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency. ACM Transactions on Internet of Things.
15 Jia, X., Gu, B., Chen, J., Gao, L., Pang, W., Lv, G., … & Cui, L. (2025). Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT. Big Data Mining and Analytics, 8(6), 1307-1323.
16 Jang, S., & Morabito, R. (2025). Edge-first language model inference: Models, metrics, and tradeoffs. arXiv preprint arXiv:2505.16508.
17 Jiang, X., Yan, L., Vavekanand, R., & Hu, M. (2023, July). Large Language Models in Healthcare Current Development and Future Directions. In Generative AI Research.
18 Kao, J. P. (2025). The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis. arXiv preprint arXiv:2509.16328.
19 Kallakuri, U., Humes, E., Rashid, H. A., & Mohsenin, T. (2025). MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models. ACM Transactions on Embedded Computing Systems.
20 Khanna, D., Guru, A. K., Sridhar, S., Ahmed, Z., Bahirwani, R., Malhotra, M., … & Ghosh, K. (2025). QuickSilver–Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization. arXiv preprint arXiv:2506.22396.
21 Kim, G. I., Hwang, S., & Jang, B. (2025). Efficient compressing and tuning methods for large language models: A systematic literature review. ACM Computing Surveys, 57(10), 1-39.
22 김윤수. (2025). Study on Visualization Techniques for Enhancing Interpretability of Deep Learning Models and System Optimization through Model Compression (Doctoral dissertation, 서울대학교 대학원).
23 Kumar, S., Ranjan, V., Chakrabarti, A., Das, T. K., & Singh, A. (2025). Efficient Biomedical Text Summarization With Quantized LLaMA 2: Enhancing Memory Usage and Inference on Low Powered Devices. Expert Systems, 42(2), e13760.
24 Kumar, S., Ranjan, V., Chakrabarti, A., Das, T. K., & Singh, A. (2025). Efficient Biomedical Text Summarization With Quantized LLaMA 2: Enhancing Memory Usage and Inference on Low Powered Devices. Expert Systems, 42(2), e13760.
25 Li, H., Huang, D., Wang, Z., & Rahmani, A. M. (2025). Skewed memorization in large language models: Quantification and decomposition. arXiv preprint arXiv:2502.01187.
26 Liu, H., Chen, Z., Li, P., Liu, Y. Z., Liu, X., Xu, R. X., & Sun, M. (2025). Resource-efficient instruction tuning of large language models for biomedical named entity recognition. Journal of Biomedical Informatics, 104896.
27 Mahesh, T. R., Sivakami, R., Thakur, A., Shankar, A., & Alqahtani, F. (2025). Fine Tuned LLM With Lora-Q for Enhanced Health Literacy. IEEE Transactions on Consumer Electronics.
28 Malghan, A. R. (2025). Evaluating Computational Accuracy of Large Language Models in Numerical Reasoning Tasks for Healthcare Applications. arXiv preprint arXiv:2501.13936.
29 Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., & Jia, Z. (2025). Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Computing Surveys, 58(1), 1-37.
30 Neveditsin, N., Lingras, P., & Mago, V. (2025). Clinical insights: A comprehensive review of language models in medicine. PLOS Digital Health, 4(5), e0000800.
31 Nissen, L., Zagar, P., Ravi, V., Zahedivash, A., Reimer, L. M., Jonas, S., … & Schmiedmayer, P. (2025). Medicine on the edge: Comparative performance analysis of on-device LLMs for clinical reasoning. arXiv preprint arXiv:2502.08954.
32 Nimmagadda, Y. (2025). Model optimization techniques for edge devices. Model Optimization Methods for Efficient and Edge AI: Federated Learning Architectures, Frameworks and Applications, 57-85.
33 Pinnock, A., Jayakody, S., Roxy, K. A., & Ahmed, M. R. (2025). EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model. arXiv preprint arXiv:2506.09061.
34 Poddar, S., Koley, P., Misra, J., Podder, S., Ganguly, N., & Ghosh, S. (2025). Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models. arXiv preprint arXiv:2502.05610.
35 Qu, G., Chen, Q., Wei, W., Lin, Z., Chen, X., & Huang, K. (2025). Mobile edge intelligence for large language models: A contemporary survey. IEEE Communications Surveys & Tutorials.
36 Qin, R., Liu, D., Xu, C., Yan, Z., Tan, Z., Jia, Z., … & Shi, Y. (2025). Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 30(5), 1-58.
37 Qin, R., Liu, D., Xu, C., Yan, Z., Tan, Z., Jia, Z., … & Shi, Y. (2025). Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 30(5), 1-58.
38 Ray, P. P. (2025). A Review on LLMs for IoT Ecosystem: State-of-the-art, Lightweight Models, Use Cases, Key Challenges, Future Directions. Authorea Preprints.
39 Shinde, G., Ravi, A., Dey, E., Sakib, S., Rampure, M., & Roy, N. (2025). A Survey on Efficient Vision‐Language Models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(3), e70036.
40 Shi, T., & Ding, Y. (2025). Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective. arXiv preprint arXiv:2508.16712.
41 Steve, T. S. T., & Yeluripati, G. R. (2025, July). LiteGuard: Efficient Offline Language Model for Resource-Constrained Healthcare. In 2025 Conference on Information Communications Technology and Society (ICTAS) (pp. 1-7). IEEE.
42 S. Wang et al., “Quantized Neural Networks: Training and Inference Techniques,” Journal of ML Systems, 2023.
43 Tao, W., Zhang, B., Qu, X., Wan, J., & Wang, J. (2025, March). Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference. In 2025 Design, Automation & Test in Europe Conference (DATE) (pp. 1-7). IEEE.
44 Xiang, M., Fernando, R., & Wang, B. (2025). On-Device Qwen2. 5: Efficient LLM Inference with Model Compression and Hardware Acceleration. arXiv preprint arXiv:2504.17376.
45 Xu, M., Cai, D., Yin, W., Wang, S., Jin, X., & Liu, X. (2025). Resource-efficient algorithms and systems of foundation models: A survey. ACM Computing Surveys, 57(5), 1-39.
46 Wangni, J. (2025). GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable. arXiv preprint arXiv:2504.07513.
47 Wang, X., Dang, T., Zhang, X., Kostakos, V., Witbrock, M. J., & Jia, H. (2025). HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring. arXiv preprint arXiv:2509.07260.
48 Wang, W., Han, L., Zhou, J., Yu, J., Xie, J., Kong, C., & Gao, Z. (2025). TripleOptim: A Comprehensive Optimization Framework for GPTQ Quantization Inference on Heterogeneous Platforms. KSII Transactions on Internet & Information Systems, 19(5).
49 Wang, X., Xu, Z., & Sui, X. (2025). Intelligent data analysis in edge computing with large language models: applications, challenges, and future directions. Frontiers in Computer Science, 7, 1538277.
50 Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025, June). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (pp. 398-406).
51 Whitmore, J., Hastings, C., Patel, A., & Brody, S. (2025). Efficient Inference of Large Language Models through Model Compression.
52 Yuan, Z., Sun, W., Liu, Y., Zhou, H., Zhou, R., Li, Y., … & Ye, Y. (2025). EfficientLLM: Efficiency in Large Language Models. arXiv preprint arXiv:2505.13840.
53 Zeng, C., Liu, S., Xie, Y., Liu, H., Wang, X., Wei, M., … & Mei, X. (2025, April). Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 21, pp. 22299-22307).
54 Zhan, Z., Zhou, S., Zeng, M., Yu, K., Song, M., Chen, X., … & Zhang, R. (2025). Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation. arXiv preprint arXiv:2509.04534.
55 H. Zhang et al., “Efficient Transformers for Edge Devices,” Proceedings of Systems ML, 2024.
56 Zhang, F., Liu, Y., Li, W., Lv, J., Wang, X., & Bai, Q. (2025). Towards Superior Quantization Accuracy: A Layer-sensitive Approach. arXiv preprint arXiv:2503.06518.
57 Zhang, T., Su, R., Zhong, A., Fang, M., & Zhang, Y. D. (2025). From Data to Deployment: A Comprehensive Analysis of Risks in Large Language Model Research and Development. IET Information Security, 2025(1), 7358963.
58 Zhang, K., Meng, X., Yan, X., Ji, J., Liu, J., Xu, H., … & Tang, Y. D. (2025). Revolutionizing health care: The transformative impact of large language models in medicine. Journal of medical Internet research, 27, e59069.
59 Zhang, L., Jiang, Y., He, G., Chen, X., Lv, H., Yao, Q., … & Chen, K. (2025). Efficient mixed-precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601.
60 Zhang, R., Jiang, H., Wang, W., & Liu, J. (2025). Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics, 14(7), 1345.
61 Zhang, Y. (2025). Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models. arXiv preprint arXiv:2507.00653.
62 Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., & Chen, J. (2025). A review on edge large language models: Design, execution, and applications. ACM Computing Surveys, 57(8), 1-35.
63 Zhen, R., Li, J., Ji, Y., Yang, Z., Liu, T., Xia, Q., … & Zhang, M. (2025). Taming the Titans: A Survey of Efficient LLM Inference Serving. arXiv preprint arXiv:2504.19720.
64 Zhu, Z., Sun, M., Zheng, E., & Cai, M. (2025, April). Research on LLM speculative inference in training-inference integrated computing infrastructure scenarios. In International Conference on Computer Application and Information Security (ICCAIS 2024) (Vol. 13562, pp. 521-528). SPIE.
65 Zhu, Y., & Lu, H. (2025). Edge-side NPU inference optimization: Adaptation research of multimodal large models on qualcomm platforms. Intelligent Data Analysis, 1088467X251342172.

Others

S/N References
1 Whitmore, J., Hastings, C., Patel, A., & Brody, S. (2025). Efficient Inference of Large Language Models through Model Compression.
2 S. Wang et al., “Quantized Neural Networks: Training and Inference Techniques,” Journal of ML Systems, 2023.
3 H. Zhang et al., “Efficient Transformers for Edge Devices,” Proceedings of Systems ML, 2024.
4 Hugging Face Docs: transformers, accelerate, and bitsandbytes libraries (for implementation details).
5 Zhan, Z., Zhou, S., Zeng, M., Yu, K., Song, M., Chen, X., … & Zhang, R. (2025). Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation. arXiv preprint arXiv:2509.04534.
6 Husom, E. J., Goknil, A., Astekin, M., Shar, L. K., Kåsen, A., Sen, S., … & Soylu, A. (2025). Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency. ACM Transactions on Internet of Things.
7 Xiang, M., Fernando, R., & Wang, B. (2025). On-Device Qwen2. 5: Efficient LLM Inference with Model Compression and Hardware Acceleration. arXiv preprint arXiv:2504.17376.
8 Zeng, C., Liu, S., Xie, Y., Liu, H., Wang, X., Wei, M., … & Mei, X. (2025, April). Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 21, pp. 22299-22307).
9 Zhang, R., Jiang, H., Wang, W., & Liu, J. (2025). Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics, 14(7), 1345.
10 Boabang, F., & Gyamerah, S. A. Assessing Compression Strategies in Large Language Models for Symptom Extraction from Clinical Notes.
11 Yuan, Z., Sun, W., Liu, Y., Zhou, H., Zhou, R., Li, Y., … & Ye, Y. (2025). EfficientLLM: Efficiency in Large Language Models. arXiv preprint arXiv:2505.13840.
12 김윤수. (2025). Study on Visualization Techniques for Enhancing Interpretability of Deep Learning Models and System Optimization through Model Compression (Doctoral dissertation, 서울대학교 대학원).
13 Behera, A. P., Champati, J. P., Morabito, R., Tarkoma, S., & Gross, J. (2025). Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques. arXiv preprint arXiv:2506.06579.
14 Fang, M., Wang, Z., Pan, S., Feng, X., Zhao, Y., Hou, D., … & Dong, D. (2025). Large models in medical imaging: Advances and prospects. Chinese Medical Journal, 10-1097.
15 Boabang, F., & Gyamerah, S. A. Assessing Compression Strategies in Large Language Models for Symptom Extraction from Clinical Notes.
16 Zheng, Y., Chen, Y., Qian, B., Shi, X., Shu, Y., & Chen, J. (2025). A review on edge large language models: Design, execution, and applications. ACM Computing Surveys, 57(8), 1-35.
17 Kim, G. I., Hwang, S., & Jang, B. (2025). Efficient compressing and tuning methods for large language models: A systematic literature review. ACM Computing Surveys, 57(10), 1-39.
18 Zhang, L., Jiang, Y., He, G., Chen, X., Lv, H., Yao, Q., … & Chen, K. (2025). Efficient mixed-precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601.
19 Aline, C., Kristensen, M., Thomsen, F., Holm, L., & Sondergaard, E. (2025). Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches.
20 Xu, M., Cai, D., Yin, W., Wang, S., Jin, X., & Liu, X. (2025). Resource-efficient algorithms and systems of foundation models: A survey. ACM Computing Surveys, 57(5), 1-39.
21 Jang, S., & Morabito, R. (2025). Edge-first language model inference: Models, metrics, and tradeoffs. arXiv preprint arXiv:2505.16508.
22 Kao, J. P. (2025). The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis. arXiv preprint arXiv:2509.16328.
23 Shi, T., & Ding, Y. (2025). Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective. arXiv preprint arXiv:2508.16712.
24 Czakó, P., Kertész, G., & Szénási, S. (2025). Addressing Activation Outliers in LLMs: A Systematic Review of Post-Training Quantization Techniques. IEEE Access.
25 Kumar, S., Ranjan, V., Chakrabarti, A., Das, T. K., & Singh, A. (2025). Efficient Biomedical Text Summarization With Quantized LLaMA 2: Enhancing Memory Usage and Inference on Low Powered Devices. Expert Systems, 42(2), e13760.
26 Kumar, S., Ranjan, V., Chakrabarti, A., Das, T. K., & Singh, A. (2025). Efficient Biomedical Text Summarization With Quantized LLaMA 2: Enhancing Memory Usage and Inference on Low Powered Devices. Expert Systems, 42(2), e13760.
27 Liu, H., Chen, Z., Li, P., Liu, Y. Z., Liu, X., Xu, R. X., & Sun, M. (2025). Resource-efficient instruction tuning of large language models for biomedical named entity recognition. Journal of Biomedical Informatics, 104896.
28 Wang, X., Dang, T., Zhang, X., Kostakos, V., Witbrock, M. J., & Jia, H. (2025). HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring. arXiv preprint arXiv:2509.07260.
29 Nissen, L., Zagar, P., Ravi, V., Zahedivash, A., Reimer, L. M., Jonas, S., … & Schmiedmayer, P. (2025). Medicine on the edge: Comparative performance analysis of on-device LLMs for clinical reasoning. arXiv preprint arXiv:2502.08954.
30 Tao, W., Zhang, B., Qu, X., Wan, J., & Wang, J. (2025, March). Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference. In 2025 Design, Automation & Test in Europe Conference (DATE) (pp. 1-7). IEEE.
31 AlRashid, H. (2025). Optimizing Large Language Models to Perform the Medical Questions Classification Task (Master’s thesis, Princess Sumaya University for Technology (Jordan)).
32 AlRashid, H. (2025). Optimizing Large Language Models to Perform the Medical Questions Classification Task (Master’s thesis, Princess Sumaya University for Technology (Jordan)).
33 Kallakuri, U., Humes, E., Rashid, H. A., & Mohsenin, T. (2025). MaGrIP: Magnitude and Gradient-Informed Pruning for Task-Agnostic Large Language Models. ACM Transactions on Embedded Computing Systems.
34 Qu, G., Chen, Q., Wei, W., Lin, Z., Chen, X., & Huang, K. (2025). Mobile edge intelligence for large language models: A contemporary survey. IEEE Communications Surveys & Tutorials.
35 Ansari, M. S., Khan, M. S. A., Revankar, S., Varma, A., & Mokhade, A. S. (2025). Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation. arXiv preprint arXiv:2505.03406.
36 Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., & Jia, Z. (2025). Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Computing Surveys, 58(1), 1-37.
37 Mahesh, T. R., Sivakami, R., Thakur, A., Shankar, A., & Alqahtani, F. (2025). Fine Tuned LLM With Lora-Q for Enhanced Health Literacy. IEEE Transactions on Consumer Electronics.
38 Pinnock, A., Jayakody, S., Roxy, K. A., & Ahmed, M. R. (2025). EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model. arXiv preprint arXiv:2506.09061.
39 Li, H., Huang, D., Wang, Z., & Rahmani, A. M. (2025). Skewed memorization in large language models: Quantification and decomposition. arXiv preprint arXiv:2502.01187.
40 Khanna, D., Guru, A. K., Sridhar, S., Ahmed, Z., Bahirwani, R., Malhotra, M., … & Ghosh, K. (2025). QuickSilver–Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization. arXiv preprint arXiv:2506.22396.
41 Wang, W., Han, L., Zhou, J., Yu, J., Xie, J., Kong, C., & Gao, Z. (2025). TripleOptim: A Comprehensive Optimization Framework for GPTQ Quantization Inference on Heterogeneous Platforms. KSII Transactions on Internet & Information Systems, 19(5).
42 Malghan, A. R. (2025). Evaluating Computational Accuracy of Large Language Models in Numerical Reasoning Tasks for Healthcare Applications. arXiv preprint arXiv:2501.13936.
43 Zhen, R., Li, J., Ji, Y., Yang, Z., Liu, T., Xia, Q., … & Zhang, M. (2025). Taming the Titans: A Survey of Efficient LLM Inference Serving. arXiv preprint arXiv:2504.19720.
44 Ray, P. P. (2025). A Review on LLMs for IoT Ecosystem: State-of-the-art, Lightweight Models, Use Cases, Key Challenges, Future Directions. Authorea Preprints.
45 Jia, X., Gu, B., Chen, J., Gao, L., Pang, W., Lv, G., … & Cui, L. (2025). Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT. Big Data Mining and Analytics, 8(6), 1307-1323.
46 Qin, R., Liu, D., Xu, C., Yan, Z., Tan, Z., Jia, Z., … & Shi, Y. (2025). Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 30(5), 1-58.
47 Qin, R., Liu, D., Xu, C., Yan, Z., Tan, Z., Jia, Z., … & Shi, Y. (2025). Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 30(5), 1-58.
48 Zhang, Y. (2025). Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models. arXiv preprint arXiv:2507.00653.
49 Wang, X., Xu, Z., & Sui, X. (2025). Intelligent data analysis in edge computing with large language models: applications, challenges, and future directions. Frontiers in Computer Science, 7, 1538277.
50 Shinde, G., Ravi, A., Dey, E., Sakib, S., Rampure, M., & Roy, N. (2025). A Survey on Efficient Vision‐Language Models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(3), e70036.
51 Erdogan, M., Tonin, F., & Cevher, V. (2025). Efficient Large Language Model Inference with Neural Block Linearization. arXiv preprint arXiv:2505.21077.
52 Al-Garadi, M., Mungle, T., Ahmed, A., Sarker, A., Miao, Z., & Matheny, M. E. (2025). Large Language Models in Healthcare. arXiv preprint arXiv:2503.04748.
53 Wang, D., Liu, B., Lu, R., Zhang, Z., & Zhu, S. (2025, June). StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems (pp. 398-406).
54 Nimmagadda, Y. (2025). Model optimization techniques for edge devices. Model Optimization Methods for Efficient and Edge AI: Federated Learning Architectures, Frameworks and Applications, 57-85.
55 Steve, T. S. T., & Yeluripati, G. R. (2025, July). LiteGuard: Efficient Offline Language Model for Resource-Constrained Healthcare. In 2025 Conference on Information Communications Technology and Society (ICTAS) (pp. 1-7). IEEE.
56 Zhu, Z., Sun, M., Zheng, E., & Cai, M. (2025, April). Research on LLM speculative inference in training-inference integrated computing infrastructure scenarios. In International Conference on Computer Application and Information Security (ICCAIS 2024) (Vol. 13562, pp. 521-528). SPIE.
57 Zhu, Y., & Lu, H. (2025). Edge-side NPU inference optimization: Adaptation research of multimodal large models on qualcomm platforms. Intelligent Data Analysis, 1088467X251342172.
58 Wangni, J. (2025). GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable. arXiv preprint arXiv:2504.07513.
59 Neveditsin, N., Lingras, P., & Mago, V. (2025). Clinical insights: A comprehensive review of language models in medicine. PLOS Digital Health, 4(5), e0000800.
60 de Jesus, B. G. A. S., de Figueiredo, T. C., dos Anjos, B. H. L., Sorte, N. C. A. B., Durães, A. R., Alves, C. R. B., … & Machado, M. A. D. (2025). Benchmarking Large Language Models for Clinical Data Extraction from Portuguese Medical Notes in a University Hospital. Authorea Preprints.
61 Zhang, F., Liu, Y., Li, W., Lv, J., Wang, X., & Bai, Q. (2025). Towards Superior Quantization Accuracy: A Layer-sensitive Approach. arXiv preprint arXiv:2503.06518.
62 Poddar, S., Koley, P., Misra, J., Podder, S., Ganguly, N., & Ghosh, S. (2025). Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models. arXiv preprint arXiv:2502.05610.
63 Zhang, T., Su, R., Zhong, A., Fang, M., & Zhang, Y. D. (2025). From Data to Deployment: A Comprehensive Analysis of Risks in Large Language Model Research and Development. IET Information Security, 2025(1), 7358963.
64 Jiang, X., Yan, L., Vavekanand, R., & Hu, M. (2023, July). Large Language Models in Healthcare Current Development and Future Directions. In Generative AI Research.
65 Zhang, K., Meng, X., Yan, X., Ji, J., Liu, J., Xu, H., … & Tang, Y. D. (2025). Revolutionizing health care: The transformative impact of large language models in medicine. Journal of medical Internet research, 27, e59069.

Leave a Reply

Your email address will not be published. Required fields are marked *