-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathindex.html
More file actions
1745 lines (1485 loc) · 108 KB
/
index.html
File metadata and controls
1745 lines (1485 loc) · 108 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="NExT-GPT: Any-to-Any Multimodal Large Language Model">
<meta name="keywords" content="text-to-image generation, Large Language Models, scene synthesis">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>NExT-GPT</title>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-PYVRSFMDRL');
</script>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="icon" href="./static/images/logo.png">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="stylesheet" href="./static/css/index-gradio.css">
<link rel="stylesheet" href="./static/css/live_theme.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<!-- <script src="./static/js/index.js"></script> -->
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title"
style="display: flex;flex-direction: row;align-items: center;justify-content: center;margin-bottom: 5px;"><img
src="./static/images/logo.png" width="60" height="60" style="margin-right: 10px;">NExT-GPT:</h1>
<h1 class="title is-2 publication-title">Any-to-Any Multimodal LLM</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://chocowu.github.io/">Shengqiong Wu</a>,</span>
<span class="author-block">
<a href="https://haofei.vip/">Hao Fei</a><sup>*</sup>,
</span>
<span class="author-block">
<a href="#">Leigang Qu</a>,</span>
<span class="author-block">
<a href="https://jiwei0523.github.io/">Wei Ji</a>,</span>
<span class="author-block">
<a href="https://www.chuatatseng.com/">Tat-Seng Chua</a>
</span>
</div>
<div class="is-size-5 publication-authors" style="margin-top: 10px;">
<span class="author-block"><a href="https://www.nextcenter.org/">NExT++ Research Center</a>, National University of Singapore</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block" style="font-size: 15px;">ICML 2024, Oral (<sup>*</sup>Correspondence)</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2309.05519.pdf"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://acc414b22d6839d28f.gradio.live" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa fa-laugh"></i>
</span>
<span>Demo</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/NExT-GPT/NExT-GPT"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://github.com/NExT-GPT/NExT-GPT" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa fa-database"></i>
</span>
<span>Dataset</span>
</a>
</span>
<!-- Video Link. -->
<span class="link-block">
<a href="https://www.youtube.com/watch?v=aqw2SCWeWD0"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-youtube"></i>
</span>
<span>Video</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Paper video. -->
<div class="columns is-centered has-text-centered">
<div class="column is-full-width">
<h2 class="title is-2">Video Presentation</h2>
<div class="publication-video">
<iframe src="https://www.youtube.com/embed/aqw2SCWeWD0?si=jIKgOkew2rmRROUy"
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</div>
</div>
</div>
<!--/ Paper video. -->
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-2">Abstract</h2>
<div class="content has-text-justified">
<p>
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation
of only input-side multimodal understanding, without the ability to produce content in multiple modalities.
As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs
capable of accepting and delivering content in any modality becomes essential to human-level AI.
To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, <b>NExT-GPT</b>.
We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate
outputs in arbitrary combinations of text, images, videos, and audio.
By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of
parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient
expansion to more potential modalities.
Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT,
based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities,
paving the way for more human-like AI research in the community.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
<br>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<h2 class="title is-2">Technical Description</h2>
<br>
</div>
<!-- Architecture -->
<div class="columns is-centered">
<div class="column is-full-width">
<h4 class="title is-3">• Architecture</h4>
<div class="content has-text-justified">
<img class="columns is-centered has-text-centered" src="./static/images/framework.png" alt="Teaser" width="95%"
style="margin:0 auto">
<br>
<figcaption>
<p style="text-align: center; color: #061E61;">
<b>Figure 1:</b> By connecting LLM with multimodal adaptors and diffusion decoders, NExT-GPT achieves universal
multimodal understanding and any-to-any modality input and output.
</p>
</figcaption>
<br>
<p>
<ul>
<li>
<b>Multimodal Encoding Stage.</b> Leveraging existing well-established models to encode inputs of various modalities.
Here we take advantage of the ImageBind, which is a unified high-performance encoder across six modalities.
Then, via the linear projection layer, different input representations are mapped into language-like representations that
are comprehensible to the LLM.
</li>
<li>
<b>LLM Understanding and Reasoning Stage.</b>
An LLM is used as the core agent of NExT-GPT.
Technically, we employ the Vicuna.
LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over
the inputs.
It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate
the decoding layers whether to generate multimodal contents, and what content to produce if yes.
</li>
<li>
<b>Multimodal Generation Stage.</b>
Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection
layers map the signal token representations into the ones that are understandable to following multimodal decoders.
Technically, we employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e.,
Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.
</li>
</ul>
</p>
<br>
<img class="columns is-centered has-text-centered" src="./static/images/config.png" alt="Teaser" width="85%"
style="margin:0 auto">
</div>
<br/>
</div>
</div>
<!-- Inference -->
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">• System Inference</h2>
<div class="content has-text-justified">
<p>
In Figure 2 we further illustrate the inference procedure of NExT-GPT.
Given certain user inputs of any combination of modalities, the corresponding modal encoders and projectors transform them
into feature representations and passed to LLM (except the text inputs, which will be directly fed into LLM).
Then, LLM decides what content to generate, i.e., textual tokens, and modality signal tokens.
If LLM identifies a certain modality content (except language) to be produced, a special type of token will be output
indicating the activation of that modality; otherwise, no special token output means deactivation of that modality.
Technically, we design the '<IMG<sub>i</sub>>' (i=0,...,4) as image signal tokens;
'<AUD<sub>i</sub>>' (i=0,...,8) as audio signal tokens; and
'<VID<sub>i</sub>>' (i=0,...,24) as video signal tokens.
After LLM, the text responses are output to the user; while the representations of the signal tokens of certain activated
modalities are passed to the corresponding diffusion decoders for content generation.
</p>
<br>
<img class="columns is-centered has-text-centered" src="./static/images/inference.png" alt="Teaser" width="100%"
style="margin:0 auto">
<br>
<figcaption>
<p style="text-align: center; color: #061E61;">
<b>Figure 2:</b> NExT-GPT inference process. Grey colors denote the deactivation of the modules.
</p>
</figcaption>
</div>
<br/>
</div>
</div>
<!-- Alignment -->
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">• Lightweight Multimodal Alignment Learning</h2>
<div class="content has-text-justified">
<p>
We design the system with mainly three tiers in loose coupling, and we only need to update the two projection layers at
encoding side and decoding side.
<ul>
<li>
<b>Encoding-side LLM-centric Multimodal Alignment.</b>
We align different inputting multimodal features with the text feature space, the representations that are understandable
to the core LLM.
</li>
<li>
<b>Decoding-side Instruction-following Alignment.</b>
We minimize the distance between the LLM's modal signal token representations (after each Transformer-based project layer)
and the conditional text representations of the diffusion models.
Since only the textual condition encoders are used (with the diffusion backbone frozen), the learning is merely based on
the purely captioning texts, i.e., without any visual or audio inputs.
</li>
</ul>
</p>
<img class="columns is-centered has-text-centered" src="./static/images/alignment.png" alt="Teaser" width="100%"
style="margin:0 auto">
<br>
<figcaption>
<p style="text-align: center; color: #061E61;">
<b>Figure 3:</b> Illustration of the lightweight multimodal alignment learning of encoding and decoding.
</p>
</figcaption>
</div>
<br/>
</div>
</div>
<!-- Instruction -->
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">• Modality-switching Instruction Tuning (MosIT)</h2>
<div class="content has-text-justified">
<p>
Further instruction tuning (IT) is necessary to enhance the capabilities and controllability of LLM.
To facilitate the development of any-to-any MM-LLM, we propose a novel Modality-switching Instruction Tuning (<b>MosIT</b>).
As illustrated in Figure 4, when an IT dialogue sample is fed into the system, the LLM reconstructs and generates the textual
content of input (and represents the multimodal content with the multimodal signal tokens).
The optimization is imposed based on gold annotations and LLM's outputs.
In addition to the LLM tuning, we also fine-tune the decoding end of NExT-GPT.
We align the modal signal token representation encoded by the output projection with the gold multimodal caption
representation encoded by the diffusion condition encoder.
Thereby, the comprehensive tuning process brings closer to the goal of faithful and effective interaction with users.
</p>
<br>
<img class="columns is-centered has-text-centered" src="./static/images/instruction_tuning.png" alt="Teaser" width="100%"
style="margin:0 auto">
<br>
<figcaption>
<p style="text-align: center; color: #061E61;">
<b>Figure 4:</b> Illustration of modality-switching instruction tuning.
</p>
</figcaption>
</div>
<br/>
<h4 class="title is-4">MosIT Data</h4>
<p>
All the existing IT datasets fail to meet the requirements for our any-to-any MM-LLM scenario.
We thus construct the <b>MosIT</b> dataset of high quality.
The data encompasses a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to
facilitate the training of MM-LLMs that can handle diverse user interactions and deliver desired responses accurately.
</p>
<img class="columns is-centered has-text-centered" src="./static/images/MosIT.png" alt="Teaser" width="100%" style="margin:0 auto">
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<h2 class="title is-2">Demonstrations</h2>
<br>
</div>
<div class="columns is-centered">
<div class="column is-full-width">
<!-- <h2 class="title is-3">Generated Images</h2>-->
<br/>
<!-- Example-1-->
<h3 class="title is-4">• Example-1: Text → Text+Image+Audio</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>I've recently started practicing mindfulness and meditation as part of my daily routine, but sometimes it's challenging.
Can you help me with some techniques?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Of course! Mindfulness and meditation can be a wonderful way to cultivate a sense of calm and focus.
One technique you might find helpful is called "mindful breathing."
It involves taking slow, deep breaths and focusing your attention on the sensation of the breath entering and leaving your body.
Another technique is "body scan meditation," where you consciously scan your entire body from head to toe, paying attention to any sensations or discomfort.
You can also try guided meditations that provide a structured practice.
Would you like me to find a video that demonstrates mindful breathing or body scan meditation?</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>I wish to listen to a music for the meditation first.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>I guess you may want some musics that sound soft.</p>
<p><audio playsinline="" controls="">
<source type="audio/wav" src="./static/images/demo1/qplfj20g.wav"></audio>
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>Base on the audio content, visualize it in an image.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo1/3x8lhkc4.jpg">
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Can you infer what time is in the picture?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Based on the image, it is likely that the time is either early morning or late afternoon.
The sunlight is either low or setting, creating a serene atmosphere.</p>
</span>
</div>
</div>
</div>
</div>
</div>
<!-- End Example-1-->
<br/>
<!-- Example-2-->
<h3 class="title is-4">• Example-2: Text+Image → Text+Image+Video</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Show me another image in similar contents.
<br/><br/>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo2/s8fj4t4g.png" >
</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo2/wpx3ziw1.jpg">
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Is there any suitable place to see this scene?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Yes, there is a suitable place to see this scene.
You can see the beautiful pink cherry blossoms in full bloom on a tree next to a river.
The image is taken from a distance, allowing you to appreciate the vibrant colors and the serene atmosphere.
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>Use a sound to describe the beautiful scene.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Sure!</p>
<audio playsinline="" controls=""><source type="audio/wav" src="./static/images/demo2/ntzx1pe_.wav"></audio>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>It would be great to see a video that show the stunning scene.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Certainly! Here is the video scene.</p>
<video src="./static/images/demo2/j3uw7df1.mp4" style="display: inline-block;" width="500" playsinline="" controls=""></video>
</span>
</div>
</div>
</div>
</div>
</div>
<!-- End Example-2-->
<br/>
<!-- Example-3-->
<h3 class="title is-4">• Example-3: Text+Video → Text+Image</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr" style="width: 100%;">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Is the guy in the video iron man?
<br/><br/>
<video src="./static/images/demo3/sf9asd3.mp4" style="display: inline-block;" width="500" playsinline="" controls=""></video>
</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Yes, the guy in the video is iron man.
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Could you show me a comic-style picture of iron man?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Certainly! Here is a comic-style picture of Iron Man: <br/><br/>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo3/42taa5y_.jpg">
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>I need a more abstract one.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Sure, here's an abstract image oflron Man:</p>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo3/390hnpkm.jpg">
</span>
</div>
</div>
</div>
</div>
</div>
<!-- End Example-3-->
<br/>
<!-- Example-4-->
<h3 class="title is-4">• Example-4: Text+Video → Text+Audio</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>What is the man doing in the video?
<br/><br/>
<video src="./static/images/demo4/birthday.mp4" style="display: inline-block;" width="500" playsinline="" controls=""></video>
</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>
The man in the video is celebrating his birthday by lighting a candle on a cake and making a wish.
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>What is the man wearing?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>The man is wearing a white shirt.
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>How about synthesizing a song to celebrate his birthday please?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>It would be a great idea to synthesize a song to celebrate the man's birthday. Here's a sample of a song that could be used:
<br/>
<ul>
<li>
Verse 1: Happy birthday, dear friend, You're shining bright, With a smile so wide. You're the light in our lives
</li>
<li>
Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
</li>
<li>
Verse 2: You're the one who brings us together, With your warmth and your grace, You're the one who makes us feel, A part of this special place.
</li>
<li>
Chorus: Celebrate, celebrate, With laughter and joy, With love and happiness, Bringing happiness to those around
</li>
<li>
Bridge: So let's raise a glass, To a life filled with love and grace. With laughter and joy, May your birthday be filled with peace
</li>
<li>
Chorus: Celebr
</li>
</ul>
</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>I guess you made me wrong. I mean, show the song in an audio.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Sorry for the mistake! Here's an audio of the song:</p>
<audio playsinline="" controls=""><source type="audio/wav" src="./static/images/demo4/7fcxy3yi.wav"></audio>
</span>
</div>
</div>
</div>
</div>
</div>
<!-- End Example-4-->
<br/>
<!-- Example-5-->
<h3 class="title is-4">• Example-5: Text+Audio → Text+Image+Video</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Describe what can be heard from the sound:</p>
<audio playsinline="" controls=""> <source type="audio/wav" src="./static/images/demo5/jsda8mk3.wav"></audio>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>The sound of a growling lion is a deep, rumbling noise that starts low in the chest and gradually builds in intensity.
It is often associated with a sense of power and strength, as well as a sense of danger or warning.
The growl of a lion can be heard from a distance and is often used by the animal to communicate with other members of its pride.</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Displaying an image to help visualize it would be fantastic.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Sure, here is an image of a lion growling:
</p><br>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo5/mg523p_t.jpg">
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>That looks great! How about showing a video that explains how the lions catch its prey?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<video src="./static/images/demo5/cwrt5vco.mp4" style="display: inline-block;" width="500" playsinline="" controls=""></video>
</span>
</div>
</div>
</div>
</div>
</div>
<!-- End Example-5 -->
<br/>
<!-- Example-6 -->
<h3 class="title is-4">• Example-6: Text → Text+Image+Video+Audio</h3>
<div class="content has-text-justified">
<div class="wrapper svelte-nab2ao"
style="box-sizing: border-box;border-width: 2px;border-style: solid;border-color: #f66c0f;border-radius: 11px;padding: 25px;padding-top: 30px;padding-bottom: 30px;">
<div class="message-wrap svelte-1uvwjgr">
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>I've recently taken up painting as a hobby, and I'm looking for some inspiration. Can you help me find some interesting ideas?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Of course! Painting is a wonderful creative outlet.
What kind of subjects are you interested in exploring? Landscapes, portraits, abstract art, or something else?</p>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user"></div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Abstract art</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Abstract art sounds great. lt allows you to express your emotions and ideas without any specific subject matter.
You can experiment with different techniques like brushstrokes, color mixing, and layeringHere's an image of an abstract painting that might inspire you:
</p><br>
<img style="display: inline-block;width: 250px;max-height: 400px;" src="./static/images/demo6/lkjw574_.jpg">
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot"><p>And also water painting. maybe l could start by watching some teaching videos.</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr latest" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>That's good idea, please check the tutorial video:</p>
<video src="./static/images/demo6/ks2oqs8s.mp4" style="display: inline-block;" width="500" playsinline="" controls=""></video>
</span>
</div>
</div>
<div class="message-row user-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/user.png" alt="avatar-user">
</div>
<div data-testid="user" class="message user svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Fantastic! Can you describe the video?</p>
</span>
</div>
</div>
<div class="message-row bot-row svelte-1uvwjgr">
<div class="avatar-container svelte-1uvwjgr">
<img class="avatar-image svelte-1uvwjgr" src="./static/images/bot.png" alt="avatar-bot"></div>
<div data-testid="bot" class="message bot svelte-1uvwjgr" dir="ltr" style="border-style: solid;">
<span class="md svelte-9tftx4 chatbot">
<p>Of course! Tha video is a tutorial on watercolor painting.
lt covers the basics ot mixing colors, creating textures and layering techniques.