Transformer Learns to Schedule Workshops Without Retraining
Researchers publish on arXiv a Transformer model trained with DRL that solves the industrial OSSP with 12-15% deviation from theoretical optimum, without retraining on larger instances.
A Transformer model trained with deep reinforcement learning (DRL) on reference instances ranging from 4×4 to 10×10 machines is capable of generating valid schedules for workshops up to 100×100 without needing to be retrained. The average deviation from the theoretical lower bound falls between 12.89% and 15.12%, according to the article published on June 15, 2026 on arXiv, arXiv:2606.13682.
This finding is not trivial: the open shop scheduling problem (OSSP) is a classic NP-hard problem that appears in manufacturing chains, diagnostic laboratories, and IT services. As the number of jobs and machines grows, exact methods become computationally infeasible and traditional heuristics require considerable manual tuning.
What the Study Proposes
The authors build a scheduling policy based on an encoder-decoder architecture with multi-head attention, the same fundamental block that underpins large language models. However, the domain here is radically different: the input is solely the matrix of processing times for each job on each machine; the output, a sequence of assignments that minimizes the makespan (total completion time).
Training is performed exclusively on the Taillard instances—a standard reference set in scheduling research—at sizes 4×4, 5×5, 7×7, and 10×10. With this data, the model learns to generate feasible schedules whose makespans generally fall between 15% and 30% above the best known values for those sizes.
Why Generalization Without Retraining Matters
The most relevant part of the work is not performance on training sizes, but what happens when applying the policy to random instances of 40×40, 60×60, 80×80, and 100×100, dimensions the model has never seen. Here it compares against four classic dispatching rules:
- SPT (Shortest Processing Time): assigns the shortest task first.
- LPT (Longest Processing Time): the longest first.
- MWKR (Most Work Remaining): prioritizes the largest pending workload.
- EST (Earliest Start Time): the one that can start earliest.
Who This Is Useful For
This approach interests mainly three profiles:
1. Operations engineers in manufacturing plants or logistics centers seeking an alternative to manual heuristics without the cost of an exact solver.
2. Researchers in combinatorial optimization exploring how far transfer of policies learned with DRL can reach across problem scales.
3. MLOps teams integrating inference models into scheduling pipelines and valuing the stability of a single model over the need to maintain multiple heuristics.
The article is honest about the limitations: the 15-30% deviation from best known values on training instances does not compete with exact solvers or well-tuned metaheuristics at small scales. The advantage appears when scale increases and available computation time is limited.
Broader Context
This work is part of an active research line that has spent several years exploring whether Transformers, originally designed for text sequences, can learn combinatorial search policies. Earlier work on TSP (Travelling Salesman Problem) or VRP (Vehicle Routing Problem) has shown similar results: reasonable generalization, performance inferior to the best specialized solvers, but notably faster inference once the model is trained.
OSSP adds specific complexity: unlike job shop scheduling, the order of operations on each machine is not predefined, which expands the search space and makes it harder to define solid baseline heuristics.
---
EP: A 12-15% deviation from the lower bound on 100×100 instances without retraining is a solid result, not spectacular. What would really be worth seeing in a next iteration is a direct comparison against reference metaheuristics—such as genetic algorithms or tabu search—on the same large instances, with controlled computation time. Until that comparison exists, the model's role remains that of a fast alternative, not the best possible option.
Sources
Read next
UP-NRPA: Real-time Dialogue Planning with LLMs Without Offline Training
Researchers propose a framework that adapts dialogue strategies in real-time using user profiles, eliminating the need for separately trained reinforcement learning models.
LLM-as-a-Judge: Evaluating with language models is more nuanced than it seems
LLM-as-a-Judge is gaining ground as an alternative to human evaluation, but its biases and multimodal limitations deserve attention before blind adoption.
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.