-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathMajor_project_proposal.tex
More file actions
1619 lines (1196 loc) · 156 KB
/
Copy pathMajor_project_proposal.tex
File metadata and controls
1619 lines (1196 loc) · 156 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt]{extarticle}
\usepackage[utf8]{inputenc}
\usepackage{enumitem}
\usepackage{cite}
\usepackage{graphicx}
\usepackage{float}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tabularx}
\usepackage{hyperref}
\usepackage{siunitx}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{makecell}
\usepackage{amsmath}
\usepackage{float}
\graphicspath{{images/}}
\colorlet{punct}{red!60!black}
\definecolor{background}{HTML}{EEEEEE}
\definecolor{delim}{RGB}{20,105,176}
\colorlet{numb}{magenta!60!black}
\lstdefinelanguage{json}{
basicstyle=\normalfont\ttfamily,
numbers=left,
numberstyle=\scriptsize,
stepnumber=1,
numbersep=8pt,
showstringspaces=false,
breaklines=true,
frame=lines,
backgroundcolor=\color{background},
literate=
*{0}{{{\color{numb}0}}}{1}
{1}{{{\color{numb}1}}}{1}
{2}{{{\color{numb}2}}}{1}
{3}{{{\color{numb}3}}}{1}
{4}{{{\color{numb}4}}}{1}
{5}{{{\color{numb}5}}}{1}
{6}{{{\color{numb}6}}}{1}
{7}{{{\color{numb}7}}}{1}
{8}{{{\color{numb}8}}}{1}
{9}{{{\color{numb}9}}}{1}
{:}{{{\color{punct}{:}}}}{1}
{,}{{{\color{punct}{,}}}}{1}
{\{}{{{\color{delim}{\{}}}}{1}
{\}}{{{\color{delim}{\}}}}}{1}
{[}{{{\color{delim}{[}}}}{1}
{]}{{{\color{delim}{]}}}}{1},
}
\begin{document}
% --- COVER PAGE ---
\begin{titlepage}
\centering
% Title
{\Large \textbf{Final Project II}\par}
\vspace{1cm}
{\Large \textbf{Planning AI Robot Arm Assistant for Tool Handling in Engineering Projects}\par}
\vspace{1cm}
% Faculty Logo
\includegraphics[width=0.25\textwidth]{cu_eng}\par\vspace{1cm}
% Submitted to
Submitted to the\\
Project Committee appointed by the\\
International School of Engineering (ISE)\\
Faculty of Engineering, Chulalongkorn University
\vspace{1cm}
{\large \textbf{Project Advisor}} \\
Asst.Prof.Paulo Fernando Rocha Garcia, Ph.D.
\vspace{1cm}
% Submitted by
{\large \textbf{Submitted By}} \\
\begin{tabular}{l l}
Kanisorn Sangchai & 6538020621 \\
Methasit Boonpun & 6538165021 \\
Withawin Kraipetchara & 6538191221 \\
Krittin Kitjaruwannakul & 6538007521 \\
\end{tabular}
\vspace{1cm}
% Course, Faculty, University info
2/2025: 2147417 Final Project II\\
Robotics and Artificial Intelligence Engineering (International Program)\\
International School of Engineering (ISE) Faculty of Engineering, Chualongkorn University
\end{titlepage}
\begin{abstract}
Frequent tool switching in complex engineering projects disrupts workflows and reduces efficiency. Robotic assistants have been proposed as a solution for tool handling, yet many existing approaches rely heavily on end-to-end neural networks that lack interpretability and robustness in dynamic, unstructured environments. Recent research highlights the benefits of combining neural perception with symbolic reasoning, a paradigm known as neuro-symbolic artificial intelligence (NSAI).
Togther with large language model (LLMs) and motion planning (TAMP) frameworks. These hybrid methods have shown function in enabling robots to understand natural language, reason about tasks, and execute reliable action in real-world scenario
The objective of this project is to develop a prototype of a Planning AI Robot Arm Assistant capable of supporting engineers task. The system aims to
\begin{enumerate}
\item interpret natural language voice commands
\item translate command into interpretable symbolic task plans
\item detect, hold, and deliver engineering tool safely
\item provide transparent reasoning process
\end{enumerate}
By focusing on interpretability, safety, and adaptability, the project addresses current limitations of existing robotic assistants and advances the development of collaborative human-robot systems.
The methodology involves integrating neural and symbolic AI within a system architecture. Speech recognition and natural language processing are used to understand user commands. These inputs are processed by a neuro-symbolic planning algorithm that generates task sequences. The system will be implemented and tested on a Universal Robots (UR) arm equipped with a gripper, depth camera, and microphone. Simulations and real-world experiment will validate the prototype to generalize to unseen scenarios and safely collaborate with human users in engineering environment
\end{abstract}
\newpage
\tableofcontents
\newpage
\section{Background}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\linewidth]{images/background_graph.png}
\caption{Relationship between TAMP, LLMs, and NSAI.}
\label{fig:background-graph}
\end{figure}
Robotic assistants are increasingly seen as partners in human–robot collaboration settings, with the long-term goal of making them part of everyday work so that interacting with a robot feels as natural as working with another person. The present work takes a proof-of-concept perspective, focusing on how such systems might begin to support human tasks in controlled scenarios, while pointing toward the broader vision of seamless integration. Reaching this vision requires advances in robot manipulation and planning, supported by developments in task and motion planning (TAMP), large language models (LLMs), and neuro-symbolic artificial intelligence (NSAI). TAMP forms the classical foundation by linking high-level symbolic decisions with the physical feasibility of actions. Building on this, LLMs extend planning capabilities by using natural language to model tasks, decompose goals, and connect human instructions to symbolic representations. NSAI then combines these approaches, integrating the perceptual strengths of neural models with the structured reasoning of symbolic methods, offering a unifying paradigm that encompasses both TAMP’s grounding in feasibility and LLMs’ language-driven flexibility, as shown in Figure~\ref{fig:background-graph}.
\subsection{Robot Manipulation and Task Planning}
Historically, robot task execution was divided into two paradigms: AI task planning and robotics motion planning. Task planning generated abstract sequences of discrete actions using symbolic frameworks such as STRIPS or PDDL, while motion planning computed collision-free paths in continuous configuration spaces. This division proved sufficient in structured factory settings, where tasks and motions could be predefined, but inadequate in unstructured, human-centric environments~\cite{tamp},\cite{optimizatoin-and-motion-planning}. Early systems, such as Shakey the Robot, assumed high-level plans could always be refined into feasible motions, an assumption that rarely holds in practice~\cite{recent-trends-in-tamp}. The core difficulty was that symbolic plans often ignored geometric and kinematic constraints, producing strategies that were logically valid but physically infeasible. Task and Motion Planning (TAMP) overcame these limitations by formulating robot planning as a hybrid discrete–continuous search problem. By combining symbolic reasoning with geometric feasibility checks and introducing an intermediate level for selecting real-valued parameters such as grasps or placements, TAMP effectively bridged high-level planning with low-level execution, employing strategies such as backtracking refinement or interleaved search to prune infeasible plans~\cite{tamp}. Beyond feasibility, optimization-based formulations yield efficient trajectories, while learning-based methods accelerate feasibility checks, guide sampling, and adapt to novel settings~\cite{optimizatoin-and-motion-planning}. Crucially, TAMP allows robots to revise task strategies when geometric constraints block execution, enabling scalable planning in complex domains~\cite{recent-trends-in-tamp}.
\subsection{Large Language Models (LLMs) in Robot Manipulation}
Large Language Models (LLMs), particularly when augmented with visual reasoning as Vision-Language Models (VLMs), are increasingly central to robot manipulation, navigation, and broader embodied AI tasks, transforming complex initial states into desired goal states through sophisticated planning~\cite{plangenllm},\cite{eval-application-challenges-llms}. Beyond directly generating action sequences, LLMs also serve as Modelers, extracting and refining structured planning models such as Planning Domain Definition Language (PDDL) specifications from natural language, which mitigates challenges in long-horizon reasoning and enhances plan reliability~\cite{llm-as-planning-formalizers}. Leveraging world knowledge, reasoning, and decision-making, LLMs facilitate Task Modeling by converting human goals into initial and goal states and decomposing complex tasks into sequential, parallel, or recursive sub-goals. In robot manipulation, LLMs/VLMs improve handling of novel objects and generate motor or programmatic actions for perception, planning, and execution~\cite{plangenllm}. Furthermore,
their impact extends to human-robot interaction (HRI), enhancing intent interpretation and adaptability~\cite{eval-application-challenges-llms}.
Functionally, LLMs enable Domain Modeling, defining essential components like actions, preconditions, and effects, sometimes through demonstrations or iterative refinement~\cite{llm-as-planning-formalizers}, and integrate with sensory data to ground language understanding in spatial contexts, ensuring executable and robust plans. Integration with classical planners, hierarchical planning, and search algorithms allows LLMs to translate abstract goals into reliable action sequences guided by world models. Closed-loop feedback systems further enable dynamic adaptation, reducing hallucinations and improving task execution, while fine-tuning on planning-specific objectives enhances correctness and generalization~\cite{plangenllm},\cite{llm-as-planning-formalizers}.
\subsection{Neuro-Symbolic Artificial Intelligence (NSAI) in Robot Manipulation}
Neuro-Symbolic Artificial Intelligence (NSAI) is an approach in robot manipulation that combines neural networks’ perceptual strengths with symbolic AI’s reasoning capabilities to create interpretable, interactive, and generalizable robotic agents in human-centric environments~\cite{enhancing-interpret},\cite{nsai}. NSAI enables robots to follow unconstrained natural language instructions for tasks like object picking, grasping, and multi-step manipulation such as pick-and-place or sorting~\cite{learning-neuro-symbolic}. Its architecture typically includes a hybrid scene encoder, neural language parser, reasoning/action primitives, symbolic program executor, and concept grounding modules. Language parsers translate instructions into executable symbolic programs, while scene encoders construct object-centric representations~\cite{enhancing-interpret},\cite{learning-neuro-symbolic}. Visual and spatial grounders match language concepts to objects, supporting generalization to unseen scenarios. Symbolic executors perform operations like filtering, querying, and executing manipulation actions~\cite{enhancing-interpret}, and simulators predict target locations for execution~\cite{learning-neuro-symbolic}. Training often uses weakly supervised, end-to-end curriculum learning with policy gradient methods~\cite{learning-neuro-symbolic},\cite{ns-vqa}. This modular design offers interpretability and enables interactive feedback. Challenges remain in integrating continuous neural models with discrete symbolic reasoning~\cite{nsai}, with future directions focusing on large language models for zero-shot parsing, vision-language models for grounding, and more complex logical structures for advanced planning~\cite{enhancing-interpret}.
\newpage
\section{Objectives}
\label{sec: objectives}
Drawing from the prior examination of robotic manipulation through natural language command systems, there are multiple challenges and limitations, such as generalization, lack of interpretability, and handling complexity, that pose room for improvements. Thus, the project aims to tackle and explore these challenges and limitations by building a prototype of a personal AI assistant with a robotic arm that can:
\begin{enumerate}
\item Understand natural language voice commands
A primary challenge is enabling robots to effectively ground abstract semantic concepts in precise spatial reasoning from natural language. Current end-to-end neural networks, while capable of learning dexterous skills, often fail to generalize to new goals or quickly learn transferable concepts across tasks when confronted with language that introduces new concepts or slight variations (eg, distinguishing between "red pens" and "blue pens" without extensive new training data)~\cite{enhancing-interpret}. This highlights a core difficulty in how robots interpret the semantics underlying tasks from human instructions. Furthermore, past language-grounding methods for manipulation were often limited by object-centric representations and struggled to integrate perception and action cohesively based on linguistic input~\cite{cliport}. Overall, robots need to move beyond simple keyword recognition to a deeper, more generalizable understanding of human language in diverse, real-world contexts~\cite{enhancing-interpret},\cite{learning-neuro-symbolic}.
\item Translate commands into symbolic task sequences
Traditional Task and Motion Planning (TAMP) approaches specify tasks using formal representations like PDDL, but these require significant expertise and lack generalizability across different problems~\cite{code-as-symbolic-planner}. A promising solution involves neuro-symbolic AI, which combines the pattern recognition strengths of neural networks with the logical reasoning and structured knowledge of symbolic AI~\cite{nsai}. This hybrid approach allows for the explicit representation of the underlying reasoning process as symbolic programs, often using a Domain-Specific Language (DSL). Such disentanglement of perception (neural) from reasoning (symbolic) leads to systems that are more sample-efficient , can generalize to unseen concept-task combinations, and enable deeper, compositional, and hierarchical reasoning over abstract concepts derived from instructions~\cite{enhancing-interpret}. Recent advancements has highlighted an alternative approach by exploring guiding LLMs to directly generate code that serves as the robot's TAMP planner and checker , integrating symbolic computation into the planning process while maintaining broad generalizability~\cite{code-as-symbolic-planner}.
\item Detect, grasp, and deliver engineering tools safely to the user
Accomplishing this requires overcoming challenges in precise spatial reasoning and robust interaction with objects in dynamic, unstructured environments. Current systems, even those showing promise in semantic understanding, may still struggle with fine-grained manipulation that demands high spatial precision, such as handling deformable objects or specific placements. Generalization to novel object instances that were not part of training data remains a hurdle, with models sometimes exploiting biases rather than truly grounding instructions~\cite{cliport}. The process involves object detection, instance segmentation, and grasp synthesis to identify and physically interact with items~\cite{enhancing-interpret}. Furthermore, ensuring safety during physical interaction is paramount, requiring rigorous validation and addressing issues like collision avoidance and potential biases from pre-trained models that could lead to harmful actions. The "sim-to-real" gap also means that models trained in simulation may require substantial fine-tuning for robust real-world performance~\cite{enhancing-interpret},\cite{cliport}.
\item Provide interpretable task plans that users can inspect, adjust, or correct
This addresses the critical problem of "black box" AI, where purely neural systems lack transparency and the ability to explain their decisions. This opacity makes it challenging for non-expert users to diagnose and correct errors in a robot's behavior. Neuro-symbolic AI explicitly tackles this by disentangling reasoning from visual perception and language understanding, leading to fully transparent and interpretable reasoning processes. By generating explicit symbolic programs that represent the robot's plan as a sequence of logical steps, the system provides a formal, interpretable representation of its decision-making. This interpretability is crucial for human-in-the-loop interaction , allowing users to understand why a robot performs certain actions, inspect the generated task plans, and provide targeted feedback or corrections when a failure occurs. The ability to communicate about failures and ambiguities in a dialogue setting significantly enhances usability and trust in autonomous systems~\cite{enhancing-interpret},\cite{nsai}.
\subsection*{Evaluation Metrics and Benchmarks}
To demonstrate successful end-to-end handover, the system must integrate language understanding, symbolic planning, perception, and grasping into a reliable pipeline. Recent studies show that robot-to-human tool handover can reach around 92.5\% success in simulation for construction tools~\cite{iaarc2025_handover}, while sim-to-real grasping frameworks report 90--97\% success depending on object familiarity~\cite{mogpe2022},\cite{grasping2023}. Performance is typically lower for novel or cluttered settings, highlighting the importance of generalization.
Timing benchmarks suggest that simple object handovers can be completed in 8--10 seconds~\cite{handover2024_fast}, though more complex engineering tools may reasonably require up to 15 seconds. Based on these findings, this project defines its main performance goal as achieving \textbf{$\geq90\%$ success over 30 trials}, with \textbf{completion within 15 seconds} and \textbf{minimal disturbance} $\leq\SI{2}{\centi\meter}$ to non-target tools.
\end{enumerate}
\newpage
\section{Literature Survey and Review}
\subsection{Existing Solutions for Robot Manipulation through Natural Language Instructions}
\subsubsection{Voice Control and Multimodal Speech Recognition for Robot Manipulation}
Voice-based interaction provides a natural interface for instructing robots, and Automatic Speech Recognition (ASR) forms the foundation of such systems. Early approaches relied on keyword spotting and grammar-based recognition, which worked reliably for simple commands in controlled environments~\cite{smith2015voice}. However, these systems often fail under noisy conditions or when commands are ambiguous, motivating the shift toward data-driven neural methods.
Recent studies have addressed ASR robustness by incorporating sequence-to-sequence parsing with noise injection, allowing semantic parsers to tolerate ASR errors commonly encountered in service robotics~\cite{tada2020robust}. Similarly, distributed architectures have been proposed, such as using an Android device for speech recognition coupled with a lightweight microcontroller (ESP32) for actuation, achieving low-latency, real-time control even in noisy environments~\cite{gupta2025speech}.
Beyond speech-only systems, multimodal ASR approaches integrate visual, gestural, or haptic information to improve recognition and grounding. For instance, the Vision-Language-Action (VLAS) framework combines speech and visual inputs to interpret commands in dynamic environments, enabling context-aware robot manipulation~\cite{yu2020vlas}. Other multimodal methods leverage gaze tracking or environmental cues to disambiguate spoken instructions, reducing errors and improving task performance~\cite{liu2021multimodal}.
Despite these advances, limitations persist. ASR systems still degrade in real-world noisy conditions, while multimodal approaches often require expensive sensors or large-scale datasets, limiting general deployment~\cite{liu2021multimodal}. Moreover, most works focus on recognition accuracy, with fewer studies addressing usability, user experience, or trust in human-robot interaction. These gaps highlight the need for solutions that are both technically robust and practically deployable in everyday environments.
\subsubsection{Robot Planning}
Recent advances in robot planning and manipulation have increasingly focused on leveraging large vision-language-action (VLA) models to enable robots to interpret high-level instructions and execute complex tasks. Rather than treating perception, reasoning, and control as isolated problems, the field has shifted towards integrated approaches where natural language commands can directly guide manipulation policies. These models combine multimodal perception, grounded reasoning, and action generation, allowing robots to handle tasks ranging from simple pick-and-place to multi-step assembly.
Two principal architectural paradigms have emerged: monolithic models and hierarchical models~\cite{vla-in-robot}. Monolithic models aim to jointly optimize perception, reasoning, and control in a single pipeline. Within this category, single-system designs treat robot actions as autoregressively generated tokens, enabling strong semantic generalization from large-scale pretraining but suffering from slow inference and limited interpretability. Dual-system designs mitigate these issues by pairing a slower, deliberative planner with a fast execution module, though they introduce challenges in synchronization and integration.
Hierarchical models, by contrast, explicitly decouple high-level planning from low-level execution. They produce interpretable intermediate representations, such as subtasks, spatial keypoints, or structured programs, that bridge natural language instructions and executable control policies. This modularity improves transparency and supports long-horizon reasoning, enabling decomposition of complex tasks into manageable steps. Approaches like program-based planning connect natural language to symbolic structures or robot APIs, balancing interpretability with execution fidelity.
Across both paradigms, several strategies address the challenges of efficiency, generalization, and robustness~\cite{vla-in-robot}. To reduce inference latency, work has explored parallel decoding, compressed action tokenization, and lightweight architectures. To improve adaptability, VLA models increasingly incorporate predictive world models, richer perception modalities (e.g., depth, tactile, and temporal cues), reinforcement learning for dense feedback, and human video data for cross-domain knowledge transfer. Safety mechanisms such as adaptive planning and dynamic risk assessment further strengthen reliability in unstructured environments.
Within this broader landscape, hybrid neuro-symbolic approaches demonstrate how symbolic reasoning can be combined with learned perception and language grounding. By integrating interpretable intermediate structures with robust neural representations, these systems can generalize to novel tasks while maintaining transparency and interactive error correction~\cite{enhancing-interpret},\cite{learning-neuro-symbolic}. Program-synthesis-based methods extend this paradigm, treating generated code as an evolving hypothesis that can be validated, repaired, and refined through execution feedback~\cite{hycodepolicy},\cite{code-as-policies},\cite{code-as-symbolic-planner}. Other efforts leverage large language models to interface directly with planning and control APIs, reducing reliance on end-to-end policy generation while increasing flexibility and autonomy~\cite{audere}.
Taken together, these developments highlight a trend toward unifying natural language, perception, and action in robotic planning. The field is moving from narrow, task-specific systems toward general-purpose agents capable of robust, interpretable, and scalable manipulation across diverse real-world settings.
\subsection{Existing Technology for Robot Manipulation through Natural Language Instructions}
\subsubsection{Voice Control and Multimodal Speech Recognition for Robot Manipulation}
Voice Control Systems (VCS) have transformed human-computer interaction by enabling intuitive, hands-free communication through natural language commands. Traditionally, VCS relies on Automatic Speech Recognition (ASR) to transcribe speech into text and Natural Language Processing (NLP) to interpret and execute instructions. These systems are widely used in digital assistants, smart homes, and IoT devices, enhancing accessibility, convenience, and efficiency in everyday tasks. In robotics, VCS lowers barriers to interaction by allowing users, including children and individuals with limited mobility, to control robots naturally through speech~\cite{vcs}. Despite these advantages, traditional ASR systems remain limited in noisy, ambiguous, or dynamic environments. They often struggle with fine-grained contextual information or visually grounded references, which are critical for successful robot manipulation. To address these challenges, multimodal ASR has emerged as an extension of VCS, incorporating both speech signals and visual context to ground linguistic input in the robot’s perceptual environment~\cite{Chang2023},\cite{multi}.
\textbf{Core Architecture of Multimodal ASR:}
\begin{itemize}
\item \textbf{Speech Encoder:} Extracts audio features from spoken instructions, often using pre-trained models such as wav2vec 2.0.
\item \textbf{Visual Encoder:} Processes visual observations from the robot’s environment (e.g., CLIP-ViT).
\item \textbf{Language Decoder:} A Transformer-based decoder that jointly attends to both speech and visual features to generate accurate transcriptions.
\end{itemize}
By combining speech with visual context, multimodal ASR achieves higher transcription accuracy, especially for visually salient words such as object names or spatial instructions. It also improves task success rates by reducing errors from misheard commands and generalizes better across new environments and diverse speakers. Benchmarks such as ALFRED have demonstrated that multimodal ASR enables robots to more reliably follow natural language instructions in simulated household environments~\cite{vcs},\cite{Chang2023},\cite{multi}. By uniting the accessibility of traditional VCS with the robustness of multimodal ASR, robots gain the ability to interpret and execute commands more effectively in uncertain and dynamic conditions. This evolution marks a critical step toward human-robot interaction that is both natural and contextually grounded~\cite{vcs},\cite{Chang2023},\cite{multi}.
\subsubsection{Robot Planning}
Robot manipulation tasks, particularly in unstructured human environments, necessitate that robots can understand and execute natural language instructions from non-expert users~\cite{cliport}. Historically, robot planning relied on segregated high-level artificial intelligence (AI) task planning and low-level motion planning, with traditional methods often using hand-coded symbols or relying on strict hierarchical decompositions~\cite{learning-neuro-symbolic},\cite{code-as-symbolic-planner}. However, these approaches struggled to effectively combine discrete task decisions with continuous geometric and kinematic considerations, limiting their applicability in dynamic, human-centric settings~\cite{cliport},\cite{code-as-symbolic-planner}.
Recent advancements have led to diverse implementations and approaches, broadly categorized as:
\begin{enumerate}[label=\Roman*.]
\item \textbf{Traditional / Early Language Grounding} \\
Traditional methods typically map natural language phrases to pre-defined symbolic representations of robot states and actions, assuming a symbolic description of the environment and actions. They often rely on symbols hand-coded by domain experts~\cite{optimizatoin-and-motion-planning} or logical parses that translate language into motion constraints or control actions, and generally lack the flexibility to autonomously learn task semantics~\cite{learning-neuro-symbolic}. While these methods provide clear symbolic interpretability, they are inherently limited in their generalizability, new goals (e.g., switching from red pens to blue pens) require new training or explicit user input~\cite{ns-vqa}.
Interpretability is high since the reasoning is explicit and symbolic, but they often fail to integrate discrete task planning with continuous motion feasibility~\cite{tamp}, restricting use in dynamic, real-world settings. Data efficiency is poor, as pre-annotated datasets are needed to map phrases to symbols. Additionally, scalability is limited since symbolic operators are predefined, and the frameworks cannot easily adapt to unseen tasks.
\item \textbf{Language-Conditioned End-to-End Systems (e.g., CLIPORT)} \\
These systems combine large-scale pre-trained vision-language models (like CLIP, for broad semantic grounding, the “what”) with architectures specialized for spatial precision (like Transporter, the “where”). Transporter Networks are highly sample-efficient, end-to-end models for robotic manipulation that avoid object-centric assumptions, exploiting spatial symmetries and generalizing well to unseen objects and configurations~\cite{transporter}. CLIPORT builds on this by integrating language-conditioned policies and semantic understanding via a two-stream design, enabling grounding of categories, shapes, colors, and text attributes without requiring hand-designed perception pipelines. They show strong data efficiency, can learn multi-task policies, and generalize across seen and unseen semantic concepts, transferring learned attributes to new tasks~\cite{cliport}.
However, challenges remain. CLIPORT struggles with fine-grained reasoning about relationships (e.g., “middle square hole” with unseen shapes), counting, or verb-noun generalization beyond training data. Its execution is limited to open-loop pick-and-place primitives, making it brittle in dynamic or dexterous tasks. Interpretability is partial: while the semantic prior is clear, the underlying reasoning process remains mostly opaque. Calibration errors or biases in training data can degrade performance, and extending to high-DOF manipulation is non-trivial~\cite{cliport}.
\item \textbf{Neuro-Symbolic Approaches} \\
Neuro-symbolic methods integrate neural perception modules with symbolic reasoning engines. They often translate natural language into executable symbolic programs composed of both neural (e.g., visual grounding) and symbolic (e.g., counting, logic) primitives. This enables them to combine robust perception with transparent, structured reasoning. Examples include the DeepSym framework for autonomously discovering symbols, neurosymbolic architectures for coupling vision and reasoning, and program-generating models for manipulation~\cite{nsai}.
NSAI systems enhance interpretability, robustness, and trustworthiness, while also facilitating learning from less data. By disentangling perception and language understanding (neural) from reasoning (symbolic), neurosymbolic systems offer a formal, interpretable representation of the underlying reasoning process~\cite{enhancing-interpret}. They are also highly sample-efficient, capable of learning from weak or “natural” supervision (initial/final states rather than dense annotations)~\cite{nsai},\cite{learning-neuro-symbolic}.
However, limitations include reliance on fixed domain-specific languages (limiting concept coverage), susceptibility to perception errors in cluttered scenes, and difficulty scaling to more complex logic (loops, conditionals)~\cite{enhancing-interpret},\cite{learning-neuro-symbolic}. Integration of neural and symbolic modules introduces computational challenges such as state-space explosion and longer training times, and standardization across frameworks remains an open problem~\cite{nsai}.
\item \textbf{Large Language Model (LLM)-Based Approaches} \\
LLM-based methods leverage the broad commonsense reasoning of large pre-trained language models for robot planning. Some approaches map natural language into symbolic planning representations (e.g., Text2Motion~\cite{text-2-motion}), while others use LLMs to choose from pre-defined primitives (e.g., Code-as-Policies~\cite{code-as-policies}). A recent paradigm, Code-as-Symbolic-Planner, queries the LLM to generate executable code that acts as both the planner and verifier, explicitly integrating symbolic computation~\cite{code-as-symbolic-planner}.
Their generalization is strong due to pretraining on vast text/code datasets. LLMs provide natural language transparency and can even generate interpretable code that reflects explicit reasoning chains~\cite{code-as-symbolic-planner},\cite{code-as-policies}. Data efficiency benefits from pretraining, though TAMP-specific fine-tuning or self-checking guidance frameworks are often required~\cite{plangenllm}.
However, challenges include brittleness in complex optimization scenarios, inconsistency in generated code, and performance degradation with increasing task complexity. Earlier LLM methods that lacked symbolic computation were unreliable for tasks with numeric constraints~\cite{code-as-symbolic-planner}. Even with code generation, outputs may be suboptimal or erroneous, requiring iterative refinement. Furthermore, real-world translation of problems into solvable code remains difficult, and generated code can introduce execution risks~\cite{plangenllm}.
\item \textbf{Vision-Language-Action (VLA) Models} \\
VLA models couple perception, language understanding, and control to enable robots to follow high-level human instructions. They leverage large vision-language models for open-world generalization, hierarchical planning, and knowledge-augmented reasoning.
Two main paradigms exist. Monolithic models integrate perception, reasoning, and control in a single pipeline, offering strong generalization but facing slow inference and limited interpretability. Hierarchical models decouple planning from execution by producing explicit intermediate outputs (e.g., subtasks, keypoints, or code), which improves modularity, transparency, and long-horizon reasoning but increases system complexity.
Efficiency and robustness are enhanced through parallel decoding, lightweight architectures, action token compression, world-model integration, advanced multimodal perception, reinforcement learning, and safety-aware planning. Remaining challenges include inference bottlenecks, synchronization between modules, and robustness across embodiments~\cite{vla-in-robot},\cite{vla}.
\end{enumerate}
\subsection{Conclusion}
The field of robot manipulation and planning through natural language is moving towards the main goal of enabling robots to seamlessly translate human intent into precise, executable actions. Current progress demonstrates the promise of LLMs and MLLMs in bridging abstract linguistic instructions with the concrete demands of real-world robotics. By combining high-level semantic understanding with fine-grained spatial reasoning, generating interpretable code as flexible policies, and leveraging neuro-symbolic reasoning for systematic generalization, researchers are building the foundations for robots that can adapt to diverse tasks and environments. These approaches, when coupled with modular design and iterative feedback loops, highlight a path toward truly intelligent and collaborative robotic assistants.
For this project, we aim to contribute to this progress by addressing some of the key limitations in the field. In particular, we will explore extending manipulation capabilities toward more complex object handling based on high-level commands, while also investigating methods for improving robustness and adaptability under real-world uncertainty. We see opportunities in integrating neural and symbolic reasoning, advancing perception systems for richer grounding, and experimenting with human-in-the-loop strategies for preference alignment and safe deployment. By targeting these areas, We hope to push the field closer to building robots that are not only capable and efficient but also trustworthy collaborators in human-centered environments.
\newpage
\section{System design and implementation}
\subsection{System Overview}
The proposed system is an end-to-end robotic manipulation pipeline that integrates perception, planning, and execution within a unified ROS2-based framework. It enables the robot to transform high-level natural language instructions into executable actions.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/System_Architecture_Overview.png}
\caption{High-level system workflow illustrating the flow of visual and language inputs through the pipeline}
\label{fig:system_workflow}
\end{figure}
Figure~\ref{fig:system_workflow} presents the overall workflow of the system. The user provides instructions via natural language, which are processed through an Automatic Speech Recognition (ASR) pipeline. In parallel, the environment is perceived through a vision module performing scene understanding, object detection, segmentation, and grasp synthesis. Both modalities are fused into symbolic representations using large language models (LLMs), enabling task understanding and reasoning.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/ros2_graph_updated.png}
\caption{ROS2-based system architecture showing key nodes and their interactions}
\label{fig:project_ros2_architecture}
\end{figure}
As shown in Figure~\ref{fig:project_ros2_architecture}, the system follows a hierarchical planning architecture. The high-level planner decomposes user instructions into subtasks, the medium-level planner maps them to executable robot capabilities, and the low-level executor performs motion execution on the hardware. These components interact closely with perception and user interface nodes within the ROS2 ecosystem.
A human-in-the-loop mechanism allows users to review and confirm generated plans before execution, while an emergency channel ensures immediate interruption of robot actions when required. This design results in a closed-loop system that supports intuitive interaction, reliable planning, and safe execution in dynamic environments.
\subsection{Vision}
The vision system is responsible for providing the robot arm with the spatial and semantic understanding required to locate, identify, and orient objects in the workspace. This section describes the design and implementation of each component in the vision pipeline: grasp pose estimation, pixel-to-real-world coordinate mapping, scene understanding via a Vision-Language Model, and open-vocabulary object detection.
\subsubsection{Grasp Pose Estimation}
The initial hardware deployment evaluated \textbf{GraspNet}~\cite{fang2020graspnet} as the primary grasp pose estimator. GraspNet produces 6-DOF grasp candidates from point clouds; however, real-hardware testing revealed that the predicted poses were excessively noisy for simple planar objects in our workspace, rendering them unreliable for closed-loop execution.
Since the majority of manipulation tasks require top-down grasping constrained to the $xy$-plane, we adopted a deterministic \textbf{Oriented Bounding Box (OBB)} approach as the primary grasp pose estimator, ensuring reliable parallel-jaw gripper alignment within the current hardware baseline. Two ROS2 services are exposed by this module:
\begin{itemize}
\item \texttt{/obb/find\_object\_angle\_bb} returns the object centroid $(u, v)$, rotation angle $\theta$, and bounding box dimensions.
\item \texttt{/obb/find\_object\_angle} returns only the centroid $(u, v)$ and rotation angle $\theta$.
\end{itemize}
The rotation angle $\theta$ is derived from the principal axis of the object mask, enabling the gripper to align with the object's dominant orientation and execute a collision-free parallel-jaw grasp. This design choice provides a deterministic and geometrically interpretable grasp strategy for planar objects. Revisiting GraspNet for complex multi-DOF grasping scenarios is planned once the hardware baseline is fully stabilised.
\subsubsection{Pixel-to-Real-World Coordinate Mapping}
A critical component of the vision pipeline is the \texttt{pixel\_to\_real\_world} module, which maps a 2D image coordinate $(u, v)$ to its corresponding real-world position $(X, Y)$ in the robot's base frame. The reliability of the OBB-based grasping strategy is fundamentally dependent on the accuracy of this mapping.
\paragraph{Baseline Evaluation.}
Initial baseline tests revealed significant spatial deviations when the operational height of the camera was altered. Relying exclusively on an extrinsic matrix derived from a single reference point without comprehensive spatial calibration violates geometric best practices: when only a localised camera coordinate is available, the transformation implicitly assumes an ideal parallel plane, failing to account for physical mounting misalignments or depth-dependent lens distortions.
Baseline discrepancies were measured at two mounting heights $H_b$. At $H_b = 54.41\,\text{cm}$, the total distance RMSE was $4.170\,\text{cm}$; at $H_b = 67.11\,\text{cm}$, it dropped to $1.244\,\text{cm}$. The substantial increase in error at the lower height, without dynamic recalibration, indicates that the root cause of coordinate displacement is \textbf{systematic} (calibration-based) rather than a quantization or resolution artefact.
\paragraph{Pinhole Camera Model.}
The intrinsic projection pipeline follows the standard Pinhole Camera Model, which maps a 3D world point $\mathbf{P} = [X,\,Y,\,Z]^T$ to a 2D pixel $\mathbf{p} = [u,\,v]^T$ via the intrinsic matrix $\mathbf{K}$ and extrinsic parameters $[\mathbf{R}\,|\,\mathbf{t}]$:
\begin{equation}
s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} \mathbf{R} \mid \mathbf{t} \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}
\label{eq:pinhole}
\end{equation}
Any misalignment in the rotation matrix $\mathbf{R}$ such as the camera sensor not being perfectly parallel to the workspace introduces an error that scales non-linearly with $Z$, limiting the model's robustness across varying height configurations.
\paragraph{Hybrid Intrinsic-Empirical Calibration.}
To mitigate systematic distortion, we benchmark the pure-intrinsic pipeline against an \textbf{empirical affine correction model} that adjusts for localised systematic deviations (scaling, shear, and translation) on the 2D output plane:
\begin{equation}
\begin{bmatrix} X_{\text{corr}} \\ Y_{\text{corr}} \end{bmatrix} = \begin{bmatrix} s_x & k \\ -k & s_y \end{bmatrix} \begin{bmatrix} X_{\text{raw}} \\ Y_{\text{raw}} \end{bmatrix} + \begin{bmatrix} t_x \\ t_y \end{bmatrix}
\label{eq:empirical}
\end{equation}
where $s_x$, $s_y$ are axis-wise scale factors, $k$ is a shear coefficient, and $t_x$, $t_y$ are translational offsets fitted from calibration measurements.
To utilise both models simultaneously without degrading system latency, a \textbf{weighted hybrid architecture} was developed. The final estimate blends the intrinsic and empirical predictions based on each pixel's radial distance from the image centre: a higher weight is assigned to the empirical model near the calibration region (centre), while the intrinsic model is prioritised toward image edges where empirical fitting generalises less reliably.
\paragraph{Calibration Evaluation.}
For each test configuration, 20 manually marked pixel points were sampled and compared against their ground-truth real-world coordinates. Results are summarised in Table~\ref{tab:error_isolation}.
\begin{table}[htbp]
\centering
\caption{RMSE comparison of pure intrinsic versus hybrid intrinsic-empirical models at two camera heights. At $0.67\,\text{m}$, the intrinsic model performs best; the empirical model overfits and adds noise. At $0.55\,\text{m}$, empirical correction successfully mitigates systematic distortion.}
\label{tab:error_isolation}
\begin{tabularx}{\textwidth}{@{} >{\centering\arraybackslash}X >{\centering\arraybackslash}X >{\centering\arraybackslash}X @{}}
\toprule
\textbf{Camera Height} & \textbf{Intrinsic Only (RMSE)} & \textbf{Intrinsic and Empirical (RMSE)} \\
\midrule
$0.67\,\text{m}$ & $1.28\,\text{cm}$ & $1.56\,\text{cm}$ \\
$0.55\,\text{m}$ & $2.35\,\text{cm}$ & $1.35\,\text{cm}$ \\
\bottomrule
\end{tabularx}
\end{table}
The results mathematically isolate and confirm that the dominant error source is \textbf{systematic} rather than quantization-based. If the error were driven by limited pixel density, the metric error should decrease or remain constant as the camera moves closer to the workspace (from $0.67\,\text{m}$ to $0.55\,\text{m}$) due to an increase in spatial resolution per pixel. The observed \textit{increase} in intrinsic-only RMSE at the lower height contradicts this hypothesis, validating the use of the empirical affine correction in that regime.
\subsubsection{Scene Understanding via VQA}
To support open-ended semantic scene understanding, we implemented the \texttt{/vision/vqa} service. This component uses the Vision-Language Model \texttt{qwen3-vl:8b}, running locally via Ollama, to answer natural-language queries about the current workspace state. The service accepts a natural-language prompt and returns the model response based on the live RGB camera feed.
\subsubsection{Object Detection and Segmentation Pipeline}
The core object detection pipeline combines Segment Anything Model (\textbf{SAM})~\cite{kirillov2023segment} with \textbf{CLIP}~\cite{radford2021clip} to achieve open-vocabulary, zero-shot object localisation.
\paragraph{Single-Object Detection (\texttt{/vision/find\_object}).}
SAM generates class-agnostic segmentation masks over the entire scene. Each mask is cropped and encoded by CLIP, and the resulting embedding is compared against the text embedding of the query label via cosine similarity. The mask region achieving the highest similarity score is returned as a single bounding box. A strict rejection threshold of $\text{similarity} < 0.2$ is enforced to suppress false positives.
\paragraph{Multi-Object Detection (\texttt{/vision/find\_multi\_object}).}
An extended service returns the top-$k$ bounding boxes ranked by CLIP similarity score, where $k$ is specified in the request. This enables the system to detect multiple instances of the same object class within a single scene, supporting tasks that require manipulating more than one object of matching appearance.
The cosine similarity used for both ranking and rejection is defined as:
\begin{equation}
\text{Similarity} = \left(\frac{\mathbf{f}_{\text{image}}}{\|\mathbf{f}_{\text{image}}\|}\right) \cdot \left(\frac{\mathbf{f}_{\text{text}}}{\|\mathbf{f}_{\text{text}}\|}\right)^T
\label{eq:cosine}
\end{equation}
where $\mathbf{f}_{\text{image}}$ and $\mathbf{f}_{\text{text}}$ are the CLIP image and text feature vectors, respectively. Similarity scores range from $-1.0$ to $1.0$.
\subsection{Speech}
The speech module enables natural language interaction between the user and the robotic system through a real-time Automatic Speech Recognition (ASR) pipeline. Its primary function is to convert spoken commands into structured textual instructions that can be consumed by the planning subsystem within the ROS2 architecture.
As illustrated in Figure~\ref{fig:speech-pipeline}, the module follows a sequential processing pipeline consisting of audio capture, speech-to-text transcription, and command validation.
\begin{figure}[h]
\centering
\includegraphics[width=\linewidth]{images/speech_pipeline.png}
\caption{Speech processing pipeline from audio input to planning and emergency control}
\label{fig:speech-pipeline}
\end{figure}
\paragraph{Audio Capture}
A microphone device continuously captures user speech input. The audio stream is processed in real time and forwarded to the ASR engine. The system is designed to operate in an event-driven manner, allowing immediate processing once speech is detected.
\paragraph{Speech-to-Text Processing}
The captured audio is transcribed into text using the AssemblyAI ASR service. The transcription process converts raw audio signals into natural language text with low latency, enabling near real-time interaction. The resulting text output is then published to a dedicated ROS2 topic, allowing other modules, particularly the planning node, to subscribe and process incoming commands.
\paragraph{Command Validation and Gating}
To ensure safe and intentional execution, a keyword-based gating mechanism is implemented. A command is only forwarded to the planning module if it contains the predefined activation keyword \texttt{``EXECUTE''}. This prevents incomplete or unintended speech from triggering robot actions.
Formally, let $S$ denote the transcribed sentence. The command is considered valid if:
\[
\text{Valid}(S) =
\begin{cases}
1, & \text{if } \texttt{``EXECUTE''} \in S \\
0, & \text{otherwise}
\end{cases}
\]
Only when $\text{Valid}(S) = 1$ will the command be published to the planning node.
\paragraph{Emergency Stop Mechanism}
In addition to command execution, the module incorporates a safety-critical interrupt mechanism. When the keyword \texttt{``STOP''} is detected, an emergency signal is immediately published to a dedicated ROS2 topic. This topic is continuously monitored by the low-level controller, which halts all robot motion upon receiving the signal.
The system remains in a halted state until the recovery keyword \texttt{``OKAY''} is detected, at which point normal operation resumes.
\paragraph{System Integration}
The speech module interfaces directly with the ROS2 ecosystem through topic-based communication. Transcribed commands are published to the planning node, while emergency signals are sent to the motion controller. This modular design ensures clear separation between perception, decision-making, and execution layers.
\subsubsection{Chat GUI Fallback Interface}
As a practical fallback for the currently missing TTS module, we implemented a lightweight chat-style graphical user interface (GUI) to preserve user-system interaction quality. The interface allows users to send typed commands, receive system responses in a conversational format, and explicitly confirm execution decisions. The main interface layout is shown in Figure~\ref{fig:chat-gui-main}.
At the command layer, the GUI dispatches high-level requests through \texttt{ros2 action send\_goal /prompt\_high\_level ...}, receives planner outputs from \texttt{/response}, and triggers confirmation through \texttt{/confirm}. To keep the interface responsive, ROS spinning and command-line invocations run in background worker threads, while the main Qt thread performs timer-driven queue polling and incremental message rendering.
From a usability perspective, the interface introduces role-separated chat bubbles (System/User/Bot), response streaming simulation, and sender-grouped messages. The emergency workflow is also integrated directly in the GUI: pressing STOP publishes \texttt{True} to \texttt{/emergency}, locks normal controls, and displays a modal safety overlay; recovery requires a deliberate full-slider unlock that publishes \texttt{False} to \texttt{/emergency}. The emergency-locked state is illustrated in Figure~\ref{fig:chat-gui-emergency}.
\IfFileExists{images/chat_gui_main.png}{
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/chat_gui_main.png}
\caption{Chat GUI fallback interface used as a temporary substitute for TTS feedback.}
\label{fig:chat-gui-main}
\end{figure}
}{}
\IfFileExists{images/chat_gui_emergency.png}{
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/chat_gui_emergency.png}
\caption{Emergency-stop GUI state with modal lock and deliberate slider-based recovery.}
\label{fig:chat-gui-emergency}
\end{figure}
}{}
Overall, the speech subsystem provides a reliable and structured interface for human–robot interaction, enabling safe command execution while maintaining responsiveness and modularity for future extensions.
\subsection{Planning}
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/ros2_graph_updated.png}
\caption{ROS2-based architecture of the hierarchical planning system}
\label{fig:planning_ros2}
\end{figure}
The planning system is implemented as a hierarchical pipeline that decomposes natural language instructions into executable robot actions. As shown in Figure~\ref{fig:planning_ros2}, the architecture separates planning into three layers: high-level reasoning, medium-level task execution, and low-level motion control. This design addresses the challenges of ambiguous language, strict execution constraints, and noisy perception, while enabling modular development and debugging.
At the \textbf{high level}, natural language instructions are translated into structured task plans. Two complementary approaches are supported. In \textbf{LLM-Direct Planning}, a large language model (LLM) directly generates a sequence of subtasks and corresponding actions for downstream execution. In \textbf{Symbolic Planning (PDDL-based)}, the instruction is transformed into a formal planning problem and solved by Fast Downward~\cite{the-fast-downward-planning-system}, producing a logically grounded action sequence. In both approaches, the generated plan is presented to the user before execution so that steps can be confirmed, revised, or rejected through natural language feedback.
Within the PDDL-based branch, we evaluate two problem-generation strategies. In \textbf{Direct Generation}, the LLM generates the full PDDL problem text from the input description. In \textbf{IR-Based Generation}, the model first generates a structured intermediate representation (IR) in JSON format, and a deterministic conversion step then renders the final PDDL problem file.
The JSON IR contains three core elements: \texttt{objects}, \texttt{init} predicates, and \texttt{goals}. This design reduces the burden of producing strict parenthesized PDDL syntax directly while preserving the symbolic content required for planning. It also aligns with API-level structured output constraints, improving parseability and reducing post-processing effort.
\begin{figure}[H]
\centering
\includegraphics[width=0.8\linewidth]{images/pddl_generation.png}
\caption{Comparison of direct PDDL generation and the proposed IR-based pipeline}
\label{fig:pddl-pipeline}
\end{figure}
As illustrated in Figure~\ref{fig:pddl-pipeline}, the structured IR is deterministically transformed into a valid PDDL problem file using Jinja2 templating. Object identifiers are mapped into the \texttt{(:objects ...)} block, initial predicates are rendered in \texttt{(:init ...)}, and goal predicates are rendered in \texttt{(:goal (and ...))}. Because this mapping is rule-based, variability is confined to IR generation while syntax rendering remains reproducible. The domain file remains fixed and manually authored, ensuring strict control over available predicates and actions.
The \textbf{medium-level planner} bridges high-level plans and execution by decomposing subtasks into sequences of callable robot capabilities. Implemented with LangChain~\cite{langchain-docs}, it exposes the \texttt{/medium\_level} action server and treats low-level skills as callable tools, enabling dynamic composition of motion steps according to current task context and feedback.
At the \textbf{low level}, the \texttt{low\_level\_planner\_executor} node executes motion commands using constrained Cartesian path planning and joint-space safety constraints to ensure smooth and predictable operation. The node provides dedicated action servers including \texttt{/plan\_cartesian\_relative}, \texttt{/get\_current\_pose}, \texttt{/get\_joint\_angles}, \texttt{/set\_joint\_angles}, and \texttt{/plan\_complex\_cartesian\_steps}. Safety is reinforced through subscription to an \texttt{/emergency} topic that can immediately interrupt motion when a critical keyword is detected.
The planning stack is tightly integrated with perception and speech. The \texttt{vision} node is available as a reasoning source for high- and medium-level planners, and the \texttt{asr} node publishes recognized commands to \texttt{/transcript} after activation-keyword detection, supporting event-driven execution flow across the hierarchy.
Overall, the hierarchical and hybrid planning framework enables reliable transformation of natural language instructions into executable robot behavior. The combined architecture preserves flexibility through language-conditioned planning, improves formal reliability through IR-based symbolic generation, and maintains safety and modularity through layered execution and emergency interruption handling.
\subsection{Simulation}
\begin{figure}[h]
\centering
\includegraphics[width=\linewidth]{images/gazebo_setup.png}
\caption{UR5 with Robotiq 2F-85 gripper and depth camera in Gazebo simulation environment}
\label{fig:simulation-gazebo-setup}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=\linewidth]{images/ros2_sim_graph_full.png}
\caption{ROS2 graph for the Gazebo simulation environment}
\label{fig:simulation-ros2-graph}
\end{figure}
The simulation environment is implemented in Gazebo Classic and provides a complete, fully integrated setup for testing perception, motion planning, and gripper control. The system models the UR5 robotic arm equipped with a Robotiq 2F-85 gripper and a downward-facing depth camera, together with a table environment and a collection of engineering objects. All components are interfaced through ROS2, enabling real-time communication and execution of motion and perception tasks within a unified simulation framework.
At the \textbf{robot and sensor level}, the UR5, gripper, and camera are simulated using standard Gazebo ROS2 interfaces. The robot exposes joint states, controllers, and transforms exactly as in a physical setup, while the RGB-D camera publishes depth and color images that integrate directly with the perception pipeline. Using \texttt{cv\_bridge}~\cite{cv_bridge}, raw Gazebo image messages are converted into OpenCV-compatible formats, allowing the vision node to operate identically to its real-world counterpart.
The simulation world includes several engineered parts used for testing perception-driven tasks. While some models contain simplified collision meshes, making precise grasping less reliable, the environment is sufficient for validating perception, planning, and trajectory execution. In line with guidance from our advisor, detailed grasp evaluation will be carried out on the real hardware, while Gazebo is used primarily for motion, planning, and system integration testing.
Overall, the Gazebo Classic simulation delivers a robust and realistic platform for evaluating the majority of our robotic pipeline. The ROS2 graph in Figure~\ref{fig:simulation-ros2-graph} highlights the modular structure that supports coordinated operation of all simulated components.
\subsection{Hardware}
The hardware subsystem is centered around a UR7 collaborative robotic manipulator equipped with a parallel gripper and an external depth camera. The system is designed to provide a stable and consistent physical platform that mirrors the interfaces and abstractions used in simulation, enabling seamless deployment of planning and perception pipelines. The deployed workstation layout and robot placement are shown in Figures~\ref{fig:hardware-setup-overall-1} and~\ref{fig:hardware-setup-overall-2}.
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{images/setup_1.jpg}
\caption{Overall hardware setup showing the UR7 platform in the experimental workspace}
\label{fig:hardware-setup-overall-1}
\end{figure}
The UR7 robot provides six degrees of freedom, allowing flexible manipulation within a typical tabletop workspace. It is interfaced with a workstation via a dedicated Ethernet connection, enabling real-time communication with the robot controller through the \texttt{ur\_robot\_driver}. Motion execution is handled using MoveIt2, which performs inverse kinematics, trajectory generation, and collision-aware planning before streaming commands to the robot.
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{images/setup_2.jpg}
\caption{Hardware setup with mounted acrylic sheet}
\label{fig:hardware-setup-overall-2}
\end{figure}
From a kinematic perspective, the pose of the robot end-effector is represented using a homogeneous transformation matrix:
\begin{equation}
T_{ee} =
\begin{bmatrix}
R & p \\
0 & 1
\end{bmatrix}
\end{equation}
\noindent
where $R$ is a $3 \times 3$ rotation matrix representing orientation, and $p = [x \; y \; z]^T$ is the position vector of the end-effector. This representation is used consistently across perception, planning, and control modules.
To ensure safe and stable motion, constraints are applied in both joint space and Cartesian space. Joint velocities are bounded as:
\begin{equation}
|\dot{q}_i| \leq \dot{q}_{\max}, \quad \forall i \in \{1,\dots,6\}
\end{equation}
\noindent
while the Cartesian velocity of the end-effector is limited by:
\begin{equation}
\|v_{ee}\| \leq v_{\max}
\end{equation}
\noindent
These constraints ensure smooth and predictable robot motion during execution.
The external depth camera is rigidly mounted relative to the robot base frame. The transformation between the camera frame and the robot base frame is defined as:
\begin{equation}
T_{base}^{cam}
\end{equation}
\noindent
which is obtained through calibration and used for coordinate transformation. Given a pixel coordinate $(u, v)$ and an associated depth value $d$, the corresponding 3D point in the camera frame is computed as:
\begin{equation}
p_{cam} = d \cdot K^{-1}
\begin{bmatrix}
u \\ v \\ 1
\end{bmatrix}
\end{equation}
\noindent
where $K$ is the camera intrinsic matrix. The point is then transformed into the robot base frame:
\begin{equation}
p_{base} = T_{base}^{cam} \cdot p_{cam}
\end{equation}
\noindent
This formulation enables consistent integration between perception outputs and motion planning inputs.
To improve perception reliability, a uniform acrylic plate is placed within the camera’s field of view to act as a controlled visual background, as shown in Figure~\ref{fig:hardware-setup-overall-2}. This reduces the impact of environmental noise and improves the robustness of object detection and segmentation.
In addition, structured cable management is implemented using a custom mounting solution to secure wiring and prevent interference with the robot’s motion (Figure~\ref{fig:hardware-cable-management}). These design considerations improve system stability and ensure safe operation within the robot workspace.
\begin{figure}[H]
\centering
\includegraphics[width=0.9\linewidth]{images/cable_management.jpg}
\caption{Cable management system using custom 3D-printed mounting components}
\label{fig:hardware-cable-management}
\end{figure}
The hardware architecture is designed to maintain full compatibility with the simulation environment, ensuring that motion plans, coordinate frames, and control interfaces remain consistent across both domains. This allows high-level plans developed and validated in simulation to be deployed directly onto the physical robot without modification.
\newpage
\section{Project Planning}
The project is structured around four primary technical area of concentration exluding any relevant paperwork tasks. These align seamlessly with the core enhancements, which will all be implemented in parallel: \textbf{image processing, speech processing, planning algorithms, and robotic arm integration}. The planned timeline is structured into weekly milestones to ensure iterative progress and early validation
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/gantt_chart_sem_2.png}
\caption{A proposed project timeline scheduled for the project’s implementation in this semester}
\label{fig:gantt-chart}
\end{figure}
\subsection{Initial Timeline}
\subsubsection{Image Processing}
[2 Weeks] Procure depth camera and camera stand
\begin{itemize}
\item Identify suitable depth camera specifications based on workspace size, required accuracy, and compatibility with the existing system.
\item Procure and assemble the depth camera and adjustable camera stand for stable mounting.
\end{itemize}
[2 Weeks] Integrate depth camera with system
\begin{itemize}
\item Install and configure camera drivers and SDKs within the system environment.
\item Validate depth and RGB data acquisition and ensure reliable data streaming.
\end{itemize}
[9 Weeks] Testing and troubleshooting
\begin{itemize}
\item Develop and refine perception pipelines for object detection, localization, and pose estimation.
\item Perform iterative testing under varying lighting and workspace conditions to improve robustness.
\end{itemize}
[4 Weeks] Evaluation and benchmarking
\begin{itemize}
\item Quantitatively evaluate perception accuracy and latency.
\item Benchmark performance against task requirements for real-time robotic manipulation.
\end{itemize}
\subsubsection{Speech Processing}
[2 Weeks] Procure microphone and needed hardware
\begin{itemize}
\item Select and procure a microphone suitable for speech capture in a lab environment.
\item Verify hardware compatibility and audio input quality.
\end{itemize}
[4 Weeks] Implement TTS and enhancement features
\begin{itemize}
\item Implement text-to-speech pipeline for bidirectional interaction.
\item Apply noise reduction and audio preprocessing to improve recognition accuracy.
\end{itemize}
[9 Weeks] Testing and troubleshooting
\begin{itemize}
\item Test speech recognition across different speakers and command variations.
\item Improve error handling, command disambiguation, and response consistency.
\end{itemize}
[4 Weeks] Evaluation and benchmarking
\begin{itemize}
\item Measure command recognition accuracy and response latency.
\item Evaluate system usability in continuous human–robot interaction scenarios.
\end{itemize}
\subsubsection{Planning Algorithm}
[2 Weeks] Complete integration of system with hardware
\begin{itemize}
\item Establish communication interfaces between perception modules, planning modules, and hardware components.
\item Validate end-to-end data flow.
\end{itemize}
[4 Weeks] Enhance planning algorithm
\begin{itemize}
\item Extend and improve the planning framework.
\end{itemize}
[9 Weeks] Testing and troubleshooting
\begin{itemize}
\item Test the planner across diverse task scenarios and edge cases.
\item Refine decision-making logic to handle incomplete or uncertain inputs.
\end{itemize}
[4 Weeks] Evaluation and benchmarking
\begin{itemize}
\item Evaluate planning success rate, adaptability, and execution efficiency.
\item Benchmark planning performance in conjunction with perception and speech modules.
\end{itemize}
\subsubsection{Interface with Robot Arm}
[5 Weeks] Integrating with actual hardware and UR arm
\begin{itemize}
\item Improve interfaces for commanding the UR robotic arm.
\end{itemize}
[9 Weeks] Testing and troubleshooting
\begin{itemize}
\item Perform controlled physical experiments to ensure safe and reliable operation.
\end{itemize}
[4 Weeks] Evaluation and benchmarking
\begin{itemize}
\item Evaluate precision, repeatability, and task completion time.
\item Assess overall system performance in real-world tool-handling scenarios.
\end{itemize}
\subsection{Current Timeline}
\subsubsection{Image Processing}
The vision subsystem development was completed on schedule, with all foundational tasks and benchmarking successfully delivered. To better suit empirical deployment, some simulation components are being adapted, notably transitioning from GraspNet to Oriented Bounding Box (OBB) representations. The pipeline is now stable. Future work will finalize these architectural shifts and evaluate system robustness across diverse physical scenarios.
\subsubsection{Speech Processing}
The main functionality of the speech module has been completed on time according to the schedule. In particular, the transcription pipeline that converts spoken commands into text has been implemented, and keyword detection for \texttt{EXECUTE}, \texttt{STOP}, and \texttt{OKAY} is now working, tested, and benchmarked.
The text-to-speech (TTS) component was not fully implemented and integrated into ROS2 within the planned timeframe. However, this does not affect the system's core functionality and does not hinder the primary user experience, since command understanding and execution remain fully operational.
As an operational substitute for missing TTS feedback, a chat GUI interface has been implemented and integrated with the ROS2 pipeline. This interface provides command entry, streamed system responses, explicit confirmation controls, and emergency-stop interaction, allowing practical day-to-day human--robot interaction while full voice feedback remains future work.
\subsubsection{Planning Algorithm}
The project is completed on time according to the schedule. All tasks listed in the Initial Timeline for the Planning Algorithm section have been completed: system integration with hardware, planning algorithm enhancement, testing and troubleshooting, and evaluation and benchmarking. In particular, the automatic PDDL generation pipeline was improved by introducing a structured intermediate representation (IR) instead of directly generating PDDL code, which reduced syntax errors, improved interpretability, and lowered token usage.
With these milestones finished, the planning pipeline is now in a stable state, and future work can focus on refining the system further and benchmarking it across more diverse task scenarios to evaluate robustness and efficiency.
\subsubsection{Interface with Robot Arm}
Progress on the hardware subsystem is on schedule, with all major integration tasks completed. The UR7 robotic manipulator has been successfully deployed and interfaced with the workstation, and stable communication through the control pipeline has been established.
The depth camera has been mounted at the base of the robot and rigidly fixed to maintain a consistent spatial relationship with the robot frame. A fixed transformation between the camera and robot base has been defined, enabling reliable perception-to-action mapping without requiring repeated calibration across sessions.
The experimental workspace has been prepared with a controlled visual setup. An acrylic plate has been installed within the camera’s field of view to provide a uniform background, improving perception consistency during operation.
Mechanical integration has also been finalized. A structured cable management solution has been implemented using a custom mounting design to secure and guide wiring, ensuring that cables do not interfere with the robot’s motion. The camera mounting system has been reinforced to minimize vibration during operation, improving overall system stability.
With these components in place, the hardware platform is now in a stable and operational state, supporting reliable execution of perception and planning tasks. Future work will focus on further improving mechanical robustness and evaluating system performance under more dynamic real-world conditions.
\newpage
\section{Theory and Technical Backup}
\subsection{Voice Control Systems (VCS)}
Voice control systems have significantly transformed human-computer interaction by enabling users to interact with interfaces using voice commands and natural language. These systems leverage automatic speech recognition (ASR) to convert voice recordings into text, and advancements in deep learning have greatly enhanced their capabilities, alongside natural language processing. VCS offers numerous benefits, including saving users time and resources by providing quick services through an efficient interface without requiring extensive additional devices. They foster natural voice communication between users and devices, enhancing user comfort and creating a more intuitive interface. The adoption rate of voice-activated digital assistants has increased substantially, leading to a wide array of Internet of Things (IoT) devices incorporating voice-activated user interfaces for tasks ranging from creating grocery lists to controlling smart homes. Beyond smart gadgets and IoT, VCS technology is also applied in automotive systems to prevent accidents by allowing drivers to focus on their surroundings. VCS significantly improves the quality of life for various groups, including elderly people, children under four, and individuals with physical disabilities. They can provide complete control over home or office appliances, facilitate traffic regulation, and enhance healthcare systems~\cite{vcs}.
\subsection{Multimodal Automatic Speech Recognition}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{asr}
\caption{Multimodal ASR model architecture.
(from~\cite{Chang2023})}
\label{fig:asr}
\end{figure}
Multimodal Automatic Speech Recognition (ASR) enhances speech transcription by integrating visual context alongside audio signals, which is crucial for tasks such as embodied agents and human-robot interaction. By leveraging both modalities, multimodal ASR improves robustness and accuracy over traditional unimodal systems, achieving lower Word Error Rate (WER) and higher Recovery Rate (RR), especially under noisy or masked audio conditions. It is particularly effective at recovering visually salient words, such as object names, thereby increasing task completion rates in instruction-following scenarios~\cite{Chang2023},\cite{multi}.
Typical architectures combine a speech encoder, such as wav2vec 2.0, with a visual encoder, often based on CLIP-ViT, feeding both into a Transformer-based decoder that attends jointly to audio and visual features. Unlike unimodal ASR, which relies only on speech input, multimodal ASR grounds transcription in environmental context, reducing ambiguity and improving generalization to unseen speakers and settings~\cite{Chang2023},\cite{multi}.
Applications include benchmarks like ALFRED, where agents must interpret spoken instructions to navigate and manipulate objects, as well as real-world human-robot communication. Compared with unimodal baselines, multimodal ASR achieves superior masked word recovery (up to 30\% more) and mitigates performance degradation in noisy conditions~\cite{Chang2023},\cite{multi}. Challenges remain regarding generalization to real-world environments, as many studies rely on synthetic data and simplified noise models.
\subsection{Neuro-Symbolic Artificial Intelligence (NSAI)}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{nsai-types}
\caption{Neuro-Symbolic AI types
(from~\cite{nsai})}
\label{fig:nsai-types}
\end{figure}
Neuro-Symbolic Artificial Intelligence (NSAI) is an emerging paradigm that combines the strengths of neural networks with symbolic reasoning. Neural networks excel in pattern recognition and data-driven learning but often lack explainability and explicit reasoning capabilities, operating as "black boxes". In contrast, symbolic AI is inherently interpretable and excels in logical reasoning but struggles with perception from raw data and requires extensive manual knowledge engineering. NSAI aims to bridge these gaps by integrating the robust statistical learning of neural networks with the structured knowledge and logic of symbolic AI, enabling systems to reason, make decisions, and generalize knowledge more effectively from large datasets. This hybrid approach enhances interpretability, robustness, and trustworthiness, while also facilitating learning from less data~\cite{nsai}.\\
NSAI systems can be categorized into various architectures based on the connection between their neural and symbolic components, as shown in Figure~\ref{fig:nsai-types}. These include:
\begin{itemize}
\item \textbf{Symbolic Wrapper (Type 1)}: Symbolic systems guide both input and output, leveraging neural networks internally for data-driven learning (e.g., DeepProbLog, Neuro-Symbolic Concept Learner (NSCL), hybrid neuro-symbolic robotics).
\item \textbf{Symbolic [Neuro] (Type 2)}: Symbolic systems dominate, with neural modules used internally for perception and pattern recognition (e.g., in autonomous communication or scene interpretation).
\item \textbf{Bidirectional Interaction (Type 3)}: Neural and symbolic components tightly interact, with neural models adjusting outputs based on symbolic constraints and symbolic modules evolving reasoning based on neural feedback (e.g., AlphaGo, AlphaZero, in robotics).
\item \textbf{Sequential Connection (Type 4)}: Neural networks connect sequentially to symbolic systems via an explicit mapping layer, converting continuous neural outputs into structured symbolic knowledge (e.g., Neural Symbolic Machines (NSM)).
\item \textbf{Embedded Symbolic Information (Type 5)}: Symbolic information is directly embedded into tensor representations, enhancing neural network reasoning in a differentiable manner (e.g., Logic Tensor Networks (LTNs)).
\item \textbf{Deeply Integrated (Type 6)}: Symbolic structures are directly processed by neural networks, blending symbolic reasoning and neural computation seamlessly (e.g., Neural Theorem Provers (NTP), Graph Neural Networks (GNNs)).
\end{itemize}
\subsection{Large Language Model (LLM)}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\linewidth]{transformer}
\caption{The Transformer-model architecture
(from~\cite{attention-is-all-you-need})}
\label{fig:transformer}
\end{figure}
Large Language Models (LLMs), such as BERT and GPT-5, are pretrained neural networks that serve as powerful foundation models for language understanding and generation. They build on the Transformer architecture, shown in Figure~\ref{fig:transformer}, a sequence model that relies entirely on attention mechanisms and avoids recurrence or convolutions. The Transformer employs an encoder-decoder structure with stacked layers of multi-head self-attention and feed-forward networks, enabling the model to relate different positions within a sequence and capture diverse representation subspaces. Positional encodings are added to input embeddings to provide information about token order~\cite{attention-is-all-you-need}. LLMs represent words as vectors and acquire robust language and world knowledge through large-scale self-supervised learning, such as predicting masked words or the next word in a sequence. These models excel at tasks including machine translation, question answering, text classification, and text generation, leveraging vast amounts of text data to learn meanings and factual knowledge. Despite their capabilities, LLMs have limitations in deep understanding, reasoning, and completeness of knowledge, and raise societal concerns regarding bias and centralized model control~\cite{human-language-understanding}.
The reason why it is preferred over traditional Natural Language Processing (NLP) models is due to
the impressive ability to grasp broad contexts with little to no training. Mostly outperforms various
Natural Language Processing (NLP) techniques without having to fine-tune with execution cost.
Among the algorithms capable of handling a range of Natural Language Processing tasks, Large
Language Model (LLM) has not only frequently appeared in numerous past studies but has also
showcased immense potential for generating plans, transforming an initial world state into a desired state~\cite{plangenllm}.
\subsection{LangChain}
The open-source framework LangChain is designed to simplify the construction of applications powered by large language models (LLMs). In essence, LangChain supplies modular components, standardised abstractions, and orchestration utilities so that developers can build end-to-end LLM-based pipelines, agents, chatbots, retrieval-augmented generation (RAG) systems and more~\cite{langchain-docs}.
\subsubsection*{Overall architecture}
At a high level, LangChain can be thought of as layering on top of LLM providers (OpenAI, Anthropic, Google, etc.), providing a standard model-interface, a message history abstraction, tool-invocation support, memory/RAG support, and agent orchestration. Its goals include:
\begin{itemize}
\item A “standard model interface” so that different model providers can be swapped without deeply rewriting code.
\item Pre-built abstractions such as “chains”, “agents” and “tools” so the developer doesn’t start from scratch.
\item Capacity for persistence, human-in-the-loop control, durable execution and streaming, in part via its underlying runtime platform (LangGraph) though for many basic usages developers need not explicitly interface with it.
\end{itemize}
\subsubsection*{Core concepts}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\linewidth]{images/langchain-agent.png}
\caption{Langchain Agent (from~\cite{langchain-docs})}
\label{fig:langchain-agent}
\end{figure}
\paragraph{Models}
Models correspond to LLMs (or chat models) that produce text (or chat) responses. LangChain exposes a standard model interface so that developers can change the underlying provider (e.g., switching from OpenAI to Anthropic) without rewriting large parts of code.
\paragraph{Messages}
Messages in LangChain represent the conversational context and follow roles like “system”, “user”, “assistant”. The message history can be passed to a model invocation so that it is aware of prior exchanges. This abstraction enables chat-style interactions and helps maintain context.
\paragraph{Tools}
Tools are external functions or actions that an agent can invoke. They might include retrieval of information, database queries, web APIs, or custom business logic. By combining the model’s generative power and tool invocation, agents built with LangChain can go beyond simple completion tasks into action-oriented workflows.
\paragraph{Short-Term Memory (and longer-term memory) }
Memory abstractions allow an agent to keep track of prior information and recall it during the interaction. Short-term memory might simply buffer recent messages; longer-term memory or retrieval-augmented memory might use vector stores or knowledge bases to fetch relevant prior context. This capability is critical for multi-turn dialogues or when the system must reference earlier interactions.
\paragraph{Agents}
Agents are orchestrators combining models, tools, messages, and memory into behaviour-enabled entities. A LangChain agent uses the model as the reasoning engine, deciding whether to call a tool (and which one), how to interpret responses, and iteratively work towards solution, as shown in Figure~\ref{fig:langchain-agent}.
\subsection{Vision Language Model (VLM)}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{images/vlm-structure.png}
\caption{Structure of a Typical Vision Language Model (from~\cite{huggingface_vlms_2024})}
\label{fig:vlm-structure}
\end{figure}
Vision Language Models (VLMs) are multimodal generative AI models that combine image and text understanding to perform a wide range of tasks, including visual question answering, image captioning, document analysis, and instruction-based image recognition. Many VLMs can also reason about spatial relationships in images, producing outputs like bounding boxes or segmentation masks. Architecturally, they usually consist of an image encoder, a projection layer that aligns image and text embeddings, and a text decoder for generating output, as shown in Figure~\ref{fig:vlm-structure}. Popular open-source VLMs include LLaVA, KOSMOS-2, Qwen-VL, CogVLM, and Fuyu-8B, each varying in model size, supported features such as chat or grounding, and image resolution~\cite{huggingface_vlms_2024}.
VLMs are powerful in zero-shot generalization, allowing them to handle unseen images and prompts without additional training. Furthermore, fine-tuning allows adaptation to specialized applications, such as educational tools, interactive assistants, or automated document evaluation. Their limitations include susceptibility to hallucinations, biases from training data, and high computational requirements for end-to-end training or large models~\cite{huggingface_vlms_2024}.
\subsection{Vision-Language-Action (VLA) Models}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{images/vla.png}
\caption{End-to-end Tokenization and Representation Process in VLA Models (from~\cite{vla})}
\label{fig:vla}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=\linewidth]{images/vla-in-robot.png}
\caption{Comparison between Monolithic and Hierarchical models (from~\cite{vla-in-robot})}
\label{fig:vla-in-robot}
\end{figure}
Vision-Language-Action (VLA) models are multimodal AI systems that unify visual perception, natural language understanding, and embodied action into a single framework. Originally developed to overcome the fragmentation of traditional robotic pipelines, VLAs integrate pretrained Vision-Language Model (VLM) backbones with action policies, enabling robots to interpret high-level human instructions, reason about spatial relationships, and execute manipulation tasks in dynamic, real-world environments~\cite{vla}. For example, a VLA can follow commands such as “place the red mug next to the laptop onto the top shelf” by grounding visual perception, spatial reasoning, and motor control in a shared computational space. Architecturally, VLAs often rely on transformer-based models that fuse visual, linguistic, and state inputs into a common embedding~\cite{vla},\cite{vla-in-robot}.
A core innovation of VLAs lies in treating robot control as a sequence generation problem. Continuous actions, such as joint angles or end-effector poses, are discretized into tokens, allowing the model to autoregressively predict action tokens in the same way language models generate text. Alongside these, prefix tokens encode context from images and text, while state tokens represent the robot’s internal configuration, as shown in Figure~\ref{fig:vla}. This unified tokenization framework enables reasoning and execution to occur in the same latent space, with a de-tokenizer translating action tokens back into executable motor commands~\cite{vla},\cite{vla-in-robot}.
Two main architectural paradigms have emerged in robotic manipulation, as show in Figure~\ref{fig:vla-in-robot}. Monolithic models directly map multimodal inputs to low-level actions, sometimes combining a “slower” VLM for reasoning with a “faster” policy module for reactive control. In contrast, hierarchical models decouple high-level planning from low-level execution, with a planner producing intermediate representations (e.g., subtasks or keypoints) for downstream policies~\cite{vla-in-robot}. These designs balance efficiency, interpretability, and modularity, depending on task complexity.
VLAs represent a shift from handcrafted pipelines toward end-to-end, language-driven policy generation. They leverage the semantic grounding of VLM pretraining while extending it to action, offering broad potential in robotics, embodied assistants, and human-robot interaction. However, they face limitations such as high computational requirements, difficulty in precisely grounding abstract instructions, and restricted generalization beyond training distributions~\cite{vla},\cite{vla-in-robot}.
\subsection{Planning Domain Definition Language (PDDL)}
\label{sec: pddl}
Planning Domain Definition Language (PDDL) is a standardized language for modeling automated planning problems in artificial intelligence. It separates a domain (the rules of the world: types, predicates, and actions) from a problem (a specific initial state and goal). PDDL allows users to specify objects, predicates, actions, and goals in a structured way, enabling automated planners to generate sequences of actions that achieve desired outcomes. However, PDDL is deliberately neutral: it encodes the physics of a domain without embedding heuristics or solving strategies, meaning many planners only support subsets of the language~\cite{pddl-1.2}.
A classic illustration is the Briefcase World. Here, moving the briefcase also moves everything inside it, while objects can be put in or taken out. Below, the domain defines actions and the problem specifies the goal of moving a dictionary and briefcase to the office while leaving a paycheck at home:
\begin{lstlisting}[language=PDDL]
(define (domain briefcase-world)
(:requirements :strips :typing :conditional-effects)
(:types location physob)
(:constants B - physob)
(:predicates (at ?x - physob ?l - location) (in ?x ?y - physob))
(:action mov-b
:parameters (?m ?l - location)
:precondition (and (at B ?m) (not (= ?m ?l)))
:effect (and (at B ?l) (not (at B ?m))
(forall (?z)
(when (and (in ?z) (not (= ?z B)))
(and (at ?z ?l) (not (at ?z ?m)))))))))
(:action put-in
:parameters (?x - physob ?l - location)
:precondition (not (= ?x B))
:effect (when (and (at ?x ?l) (at B ?l))
(in ?x)))
(:action take-out
:parameters (?x - physob)
:precondition (not (= ?x B))
:effect (not (in ?x)))
(define (problem get-paid)
(:domain briefcase-world)
(:init (place home) (place office)
(object p) (object d) (object b)
(at B home) (at P home) (at D home) (in P))
(:goal (and (at B office) (at D office) (at P home))))
\end{lstlisting}
PDDL has evolved from its early versions, like PDDL 1.2, which used predicate logic with true/false properties and actions, to more expressive forms. PDDL 2.1 introduced durative actions and numeric fluents for modeling time and resources, while PDDL 2.2 added derived predicates and timed literals. PDDL 3.0 incorporated soft constraints and preferences with costs, and PDDL+ enabled modeling of processes and uncontrollable events. Specialized variants such as PPDDL, NDDL, and MADDL were later developed for domain-specific planning needs~\cite{pddl}.
\subsection{Fast Downward}
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/fast_downward_execution.png}
\caption{The three phases of Fast Downward’s execution (from~\cite{the-fast-downward-planning-system})}
\label{fig:fd_execution}
\end{figure}
The Fast Downward planning system is a domain-independent classical planner built for deterministic planning tasks encoded in PDDL~\cite{the-fast-downward-planning-system}. It employs a forward search strategy, but crucially it does not operate directly on the propositional PDDL as given; instead it transforms the input into a multi-valued planning task (MPT) representation, and then compiles further structural knowledge (causal graphs, domain‐transition graphs, etc.) to support heuristic search~\cite{the-fast-downward-planning-system}.
\bigskip
The Fast Downward planning system operates in three main phases, as illustrated in Figure~\ref{fig:fd_execution}:
\begin{itemize}
\item \textbf{Translation.} The input PDDL task (with propositional facts, operators, axioms) is first normalized, grounded, invariants are synthesized, and an equivalent multi‐valued planning task is generated. This step is designed to expose structure (e.g., mutual exclusion of propositions) that is otherwise implicit in the propositional encoding.
\item \textbf{Knowledge Compilation.} From the MPT representation the planner builds data structures such as domain‐transition graphs (for each variable, how its values can change under operators), the causal graph (showing dependencies between variables), successor‐generators (efficient enumeration of applicable operators), and axiom evaluators (to compute derived variable values).
\item \textbf{Search.} Given the compiled knowledge structures, the planner uses heuristic best‐first search variants. Its signature heuristic is the causal-graph heuristic, which uses the MPT and the causal graph to estimate goal distances by solving subproblems induced by individual variables (in a hierarchical manner). The planner also supports multi‐heuristic best‐first search (combining the causal‐graph heuristic with the FF heuristic) and a non-heuristic method called focused iterative‐broadening search. Preferred operators (analogous to helpful actions) and deferred heuristic evaluation further enhance search efficiency.
\end{itemize}
\bigskip
The planner is open‐source, publicly available via its website~\cite{fast-downward-website}. The typical workflow includes modeling the domain in PDDL, running Fast Downward to translate and compile, then searching for a plan. Many derivative planners and portfolios build on its infrastructure. For academic and practical evaluation tasks, the strong support for the MPT translation and causal heuristic make it particularly useful when one can afford the preprocessing overhead.
\subsection{ROS2}
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/ros2_architecture.png}
\caption{ROS2 conceptual architecture, illustrating nodes, topics, services, and actions.}
\label{fig:ros2_architecture}
\end{figure}
ROS2 (Robot Operating System 2) is an open-source middleware framework designed to support the development of distributed, real-time robotic applications. Building upon the foundation of ROS1, ROS2 introduces improved communication paradigms, enhanced reliability, real-time capability, and cross-platform support, making it more suitable for both research and industrial deployment.
At its core, ROS2 is structured around nodes, which are independent processes that communicate through a decentralized publish/subscribe architecture. Nodes exchange messages over topics, invoke synchronous operations through services, and trigger asynchronous long-running tasks via actions. This modular design ensures scalability and flexibility for complex robotic systems. The underlying communication is handled through the Data Distribution Service (DDS), providing discovery, quality-of-service (QoS) policies, and secure data exchange between distributed components.
The ROS2 execution model is further supported by concepts such as parameters for runtime configuration, launch files for orchestrating multiple nodes, and lifecycle management for robust system control. Additionally, the framework introduces tools such as \texttt{ros2cli} for command-line interaction, \texttt{rclcpp}/\texttt{rclpy} for C++ and Python client libraries, and \texttt{rviz2} for visualization. Together, these components form an ecosystem that simplifies robotic software integration while maintaining performance and portability across platforms.
ROS2 represents a shift from experimental middleware to a production-ready robotics framework. Its modularity, DDS-based communication, and lifecycle-aware design provide strong technical backing for robotics research and application development, though challenges such as system complexity and tuning persist~\cite{ros2-doc}.
\subsection{MoveIt2}
MoveIt2 is a motion planning framework for ROS2 that integrates kinematics, collision checking, planning, and execution into a unified pipeline. It enables robots to interpret high-level goals, such as "pick the object and place it in the bin", and compute collision-free, dynamically feasible trajectories to accomplish them.
At its core, MoveIt2 is structured around modular components, as shown in Figure~\ref{fig:moveit_pipeline}. The Kinematics module translates between joint states and end-effector poses, while the Planning Scene Monitor maintains an up-to-date model of the environment. The Motion Planning system leverages planners such as OMPL or TrajOpt to generate valid paths, which are refined in the Trajectory Processing stage into smooth, executable trajectories.
The central move\_group node provides a user-facing API for motion planning and execution, while advanced features like Hybrid Planning combine long-horizon strategies with local reactive control. The MoveIt Task Constructor further extends this capability by composing multi-step task sequences, making it suitable for applications such as bin picking and assembly.
MoveIt2 shifts motion planning from isolated libraries toward a flexible, extensible middleware. Its modular design promotes reusability and adaptability across robotic domains, though challenges remain in accurate modeling, computational cost, and system tuning~\cite{moveit-doc},\cite{moveit2}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{images/moveit_pipeline.png}
\caption{MoveIt2 motion planning pipeline.}
\label{fig:moveit_pipeline}
\end{figure}
\subsection{Gazebo}
Gazebo is an open-source robotics simulation platform that enables realistic, high-fidelity simulation of robots, sensors, and environments. It allows developers to test algorithms, design robots, and train AI models in a virtual setting before deploying on hardware.
Gazebo uses a modular architecture, separating physics, rendering, sensor simulation, and communication. The server handles physics and sensor updates, while the client provides visualization and user interaction. Robots and environments are described using the Simulation Description Format (SDF), and plugins extend functionality for custom behaviors and controllers.
Integrated with ROS, Gazebo allows seamless simulation-to-hardware transitions. Its strengths include realistic physics, sensor modeling, and extensibility through plugins, making it suitable for research, education, and development of both single and multi-robot systems~\cite{gazebo}
\subsection{Artificial Intelligence in Image Processing}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{images/CNN.png}
\caption{Convolutional Neural Network (CNN) Architecture for Image Classification}
\label{fig:cnn}
\end{figure}
Artificial Intelligence (AI) has significantly transformed image processing, introducing cutting-edge methods and applications that streamline processes for enhanced speed and accuracy. Image processing itself involves understanding digital images as pixel-based visual information, along with various formats (e.g., JPEG, PNG), enhancement techniques (such as adjusting brightness or reducing noise), and filtering/restoration procedures. Within this framework, AI, especially through deep learning architectures like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs), enables systems to extract crucial features, recognize objects, segment images, and even generate realistic visual content. This capability allows robots to interpret visual information similarly to humans, finding practical uses in areas like autonomous vehicles for navigation and surveillance systems for security~\cite{ai-in-image-processing}.
The capabilities of AI image processing are broad and impactful, including accuracy enhancements, efficiency improvements, and the capacity to manage large datasets. AI has demonstrated significant success in improving diagnostic accuracy and early disease detection in healthcare, optimizing industrial packing strategies for greater efficiency and reduced waste, and enabling real-time applications and deployment on devices with limited resources through crucial optimization techniques like model compression. Advanced functions such as object detection, image segmentation, and content-based image retrieval are also facilitated by these methods. Despite these advantages, the application of AI in image processing faces notable limitations and ethical challenges. These include critical concerns regarding data privacy, interpretability of AI decisions, and the potential for biases embedded in training data to produce prejudiced results. Therefore, ensuring responsible, transparent AI systems that respect cultural diversity and address potential impacts on employment are vital for the technology's continued development~\cite{ai-in-image-processing}.
\subsection{Universal Robots Arm (UR ARM)}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\linewidth]{images/UR_Arm_pic.png}
\caption{the Move Screen interface of a Universal Robots (UR) robotic arm's ~\cite{urarm}}
\label{fig:urarm}
\end{figure}
The Universal Robots (UR) robotic arm is a widely adopted collaborative robotic manipulator (cobot) designed for flexible deployment in industrial, engineering, and research settings. Unlike traditional industrial robots, UR arms are specifically engineered to operate safely alongside humans without requiring extensive physical barriers, thanks to their integrated safety features, lightweight design, and force-limiting mechanisms. These properties make UR arms particularly suitable for human-robot collaboration in dynamic environments such as manufacturing, assembly, and tool-handling applications.
From a technical perspective, UR arms employ high-precision servo motors and torque sensors at each joint, enabling six degrees of freedom for dexterous manipulation tasks. Their kinematic flexibility allows execution of both structured and unstructured manipulation tasks, such as grasping, tool usage, and assembly operations. Additionally, UR arms are equipped with programmable interfaces, including graphical user interfaces (GUI), teach pendants, and increasingly, APIs compatible with external AI systems~\cite{urarm}. This enables integration with higher-level control architectures, such as neuro-symbolic AI (NSAI) or large language models (LLMs), for advanced planning, reasoning, and adaptive execution.
\subsection{URScript}
URScript is the native scripting language developed by Universal Robots for programming and controlling their collaborative robotic arms. It provides a textual programming interface that allows users to control robot motion, interact with sensors and I/O devices, and integrate the robot with external software systems. In the Universal Robots ecosystem, programs created through the graphical programming interface \textit{PolyScope} are internally translated into URScript commands, which are then executed by the robot controller~\cite{urscript-docs}. Consequently, URScript represents the fundamental low-level interface used by the controller to perform robot operations.
\subsubsection*{Overall architecture}
Universal Robots systems can be controlled at two main levels: the graphical user interface level through PolyScope, and the script level through URScript~\cite{urscript-manual}. At the script level, programs are written as text-based instructions that are interpreted by the robot controller (URControl). URControl runs as a service on the robot's control computer and executes commands sent either from PolyScope or from an external client over a TCP/IP connection.
URScript programs can therefore be executed in multiple ways: