Skip to content

Commit 0d99ece

Browse files
authored
Merge pull request #35 from h-2/docdoc
[doc] lots of documentation
2 parents e39f274 + 651c4af commit 0d99ece

File tree

17 files changed

+449
-60
lines changed

17 files changed

+449
-60
lines changed

doc/main_page.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
# Welcome {#mainpage}
22

33
Welcome to the documentation of the B.I.O. library.
4-
This web-site contains the API reference (documentation of our interfaces) and more elaborate Tutorials and
5-
How-Tos.
4+
This web-site contains the API reference (documentation of our interfaces) and some small Tutorials and HowTos.
5+
6+
B.I.O makes use of SeqAn3 and it is recommended to have a look at [their documentation](https://docs.seqan.de) first.
67

78

89
## Overview
910

11+
This section contains a very short overview of the most important parts of the library.
12+
1013

1114
### General IO Utilities
1215

@@ -20,7 +23,7 @@ The transparent streams can be used in place of the standard library streams. Th
2023
compressions such as GZip, BZip2 and BGZip.
2124

2225

23-
### Readers and Writers
26+
### Record-based I/O
2427

2528

2629
| Reader | Writer | Description |

doc/record_based/1_introduction.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Introduction {#record_based_intro}
2+
3+
Most files in bioinformatics are comprised of *records*, i.e. multiple, indepedent entries that each consist of one or
4+
more *fields*.
5+
For example, a FastA file contains one or more sequence records that each contain an ID field and sequence field.
6+
7+
[TOC]
8+
9+
```
10+
>myseq1
11+
ACGT
12+
13+
>myseq2
14+
GAGGA
15+
16+
>myseq3
17+
ACTA
18+
```
19+
20+
<center>
21+
↓↓↓↓↓↓↓
22+
</center>
23+
24+
25+
| ID field | sequence field |
26+
|:----------:|:--------------:|
27+
| "myseq1" | "ACGT" |
28+
| "myseq2" | "GAGGA" |
29+
| "myseq3" | "ACTA" |
30+
31+
Each line in this table is conceptionally "a record", and each file is modeled as a series of these records.
32+
The process of "reading a file", is transforming the on-disk representation displayed above into the "abstraction" shown below.
33+
The process of "writing a file" is the reverse.
34+
35+
Details on how records are defined is available here: \ref record_faq
36+
37+
## Readers
38+
39+
So called *readers* are responsible for detecting the format and decoding a file into a series of records:
40+
41+
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file
42+
43+
The reader is an *input range* which is C++ terminology for "something that you can iterate over (once)".
44+
The last bit is important, it implies that once you reach the end, the reader will be "empty". To iterate over it again, you need to recreate it.
45+
46+
<!-- Details on how readers are defined is available here: \ref reader_writer_faq -->
47+
48+
## Writers
49+
50+
TODO

doc/record_based/2_record_faq.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Record FAQ {#record_faq}
2+
3+
Records in B.I.O. are of implemented as a specialisation of the bio::record template.¹
4+
This behaves very similar to a std::tuple with the difference that a bio::field identifier is associated with every
5+
element and a corresponding member function is provided, so you can easily access the elements without knowing the order.
6+
7+
<small>¹ With the exception of bio::plain_io which uses bio::plain_io::record.</small>
8+
9+
[TOC]
10+
11+
\note This page contains details on how records are defined. It is meant to provide a better understanding of the design and performance implications. We recommend starting with the snippets shown in the API (e.g. bio::seq_io::reader, bio::var_io::reader, …) and only return to this page if you have questions or want to fine-tune things.
12+
13+
## What is the full type of my record? {#record_type}
14+
15+
Most records you interact with are produced by readers.
16+
17+
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file
18+
19+
In this example, `rec` is the record and with each iteration of the loop, a new record is generated from the file. The exact type of the record depends on the reader. In the above example, it is:
20+
21+
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file_type
22+
23+
That is quite long and difficulat to remember (even though definitions of X* and Y* are omitted here),
24+
so we write `auto &` instead.
25+
But it is important to know which fields are contained in the record (in this case ID, SEQ and QUAL).
26+
The documentation for the reader will tell you this, e.g. bio::seq_io::reader.
27+
28+
## How can I access the fields?
29+
30+
The easiest way to access a field, is by calling the respective member function:
31+
32+
\snippet test/snippet/seq_io/seq_io_reader.cpp simple_usage_file
33+
34+
Here, `.id()` (bio::record#id()) and `.seq()` (bio::record#seq()) are used to access the fields. Note, that the
35+
documentation has entries for all field-accessor member functions, but it depends on the specific specialisation
36+
(used by the reader) whether that function is available.
37+
So, on the record defined by bio::seq_io::reader above, the members `.id()`, `.seq()`, `.qual()` are available, but
38+
the member `.pos()` would not be.
39+
40+
When the number of fields in the record is low and you know the order, you can also use
41+
[structured bindings](https://en.cppreference.com/w/cpp/language/structured_binding)
42+
to decompose the record into its fields:
43+
44+
\snippet test/snippet/seq_io/seq_io_reader.cpp decomposed
45+
46+
Note that the order of the fields is fixed (in this case it is defined by bio::seq_io::default_field_ids).
47+
It is independent of the names you give to the bindings, so this syntax is error-prone when used with large records
48+
(e.g. those defined by bio::var_io::reader).
49+
50+
In generic contexts, you can also access fields via `get<0>(rec)` (returns the 0-th field in the record) or
51+
`get<bio::field::id>(rec)` (the same as calling `rec.id()`); but most users will never need this.
52+
53+
54+
## Does my record own the data? (Shallow vs deep records) {#shallow_vs_deep}
55+
56+
As shown above, every field has an identifier (e.g. bio::field::id) and a type (e.g. std::string_view).
57+
You may have wondered, why std::string_view is used as a type and what these `transform_view`s are.
58+
These imply that the record is a *shallow* data structure, i.e. the fields *appear* like strings or vectors, but they
59+
are implemented more like references or pointers.
60+
See the SeqAn3 documentation for an in-depth [Tutorial on Ranges and Views](http://docs.seqan.de/seqan/3-master-user/tutorial_ranges.html).
61+
62+
Shallow records imply fewer memory allocations and/or copy operations during reading. This results in a **better
63+
performance** but also in some important limitations:
64+
65+
* Shallow records cannot be modified (as easily²).
66+
* Shallow records cannot be "stored"; they depend on internal caches and buffers of the reader and become invalid
67+
as soon as the next record is read from the file.
68+
69+
70+
If you need to change a record in-place and/or "store" the record for longer than one iteration of the reader, you need to use *deep records* instead.
71+
You can tell the reader that you want deep records by providing the respective options:
72+
73+
\snippet test/snippet/seq_io/seq_io_reader.cpp options2
74+
75+
This snippet behaves similar to the previous one, except that the type of `rec` is now the following:
76+
77+
\snippet test/snippet/seq_io/seq_io_reader.cpp options2_type
78+
79+
This allows you to call std::vector's `.push_back()` member function (which is not possible in the default case).
80+
Creating this kind of record is likely a bit slower than the shallow record.
81+
82+
**Summary**
83+
84+
* The records generated by readers are *shallow* by default.
85+
* This setting has the best performance; but it is less flexible than a *deep* record.
86+
* Readers can be configured to produce *deep* records via the options.
87+
88+
<small>² Some modifying operations are possible on views, too, but this depends on the exact types.</small>
89+
90+
## How can I change the field types?
91+
92+
In the previous section, we showed how to change the field types from being shallow to deep.
93+
For some readers, more options are available, e.g. bio::seq_io::reader assumes nucleotide data for the SEQ field by default, but you might want to read protein data instead.
94+
95+
\snippet test/snippet/seq_io/seq_io_reader.cpp options
96+
97+
The snippet above illustrates how the alphabet can be changed (and how to provide another option at the same time).
98+
99+
Instead of using these pre-defined `field_types`, you can also define them completely manually. You can decide to even read only a subset of the fields by changing the `.field_ids` member:
100+
101+
\snippet test/snippet/seq_io/seq_io_reader_options.cpp example_advanced2
102+
103+
This code makes FASTA the only legal format and creates records with only the sequence field asa std::string.
104+
105+
But you can also use this mechanism to make some fields shallow and other fields deep. It also allows
106+
to choose different container types.
107+
See the API documentation of the respective `reader_options` for more advanced use-cases and the
108+
exact restrictions on allowed types.
109+
110+
## How can I create record variables?
111+
112+
There are various easy ways to create a bio::record that do not involve manually providing the template arguments:
113+
114+
1. Deduce from the reader.
115+
2. Use an alias.
116+
3. Use bio::make_record or bio::tie_record.
117+
118+
### Deduce from the reader {#record_type_from_reader}
119+
120+
When iterating over a reader, it is easy to use `auto &` to deduce the record type, but sometimes you need
121+
the record type outside of the for-loop or in a separate context.
122+
123+
This snippet demonstrates how to read an interleaved FastQ file and process the read pairs together (at every second iteration of the loop):
124+
125+
\snippet test/snippet/detail/reader_base.cpp read_pair_processing
126+
127+
To to this, you need to use deep records, because shallow records become invalid after the loop iteration.
128+
Note how it is possible to "ask" the reader for the type of its record to create the local variable.
129+
130+
### Record type aliases {#record_aliases}
131+
132+
When writing a file without reading a file previously, you can use one of the predefined aliases:
133+
134+
* bio::var_io::default_record
135+
136+
This longer example illustrates using an alias:
137+
138+
\snippet test/snippet/var_io/var_io_writer.cpp creation
139+
\snippet test/snippet/var_io/var_io_writer.cpp simple_usage_file
140+
141+
Here bio::var_io::default_record is the type that a bio::var_io::reader would generate if it is defined without any options, **except that the alias is deep by default.**
142+
This is based on the assumption that aliases are typically used to define local variables whose values you want to change.
143+
144+
### Making and tying records {#record_make_tie}
145+
146+
There are convenience functions for making and tying records, similar to std::make_tuple and std::tie:
147+
\snippet test/snippet/record.cpp make_and_tie_record
148+
149+
The type of rec1 is:
150+
\snippet test/snippet/record.cpp make_and_tie_record_type_rec1
151+
152+
The type of rec2 is:
153+
\snippet test/snippet/record.cpp make_and_tie_record_type_rec2
154+
155+
When creating a record from existing variables, you can use bio::tie_record to avoid needless copies.
156+
Instead of manually entering the identifiers as a bio::vtag, you can use bio::seq_io::default_field_ids (or the respective defaults of another reader/writer).

include/bio/detail/reader_base.hpp

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ namespace bio
3737
// ----------------------------------------------------------------------------
3838

3939
/*!\brief This is a (non-CRTP) base-class for I/O readers.
40+
* \ingroup bio
4041
* \tparam options_t Type of the reader options.
4142
* \details
4243
*
@@ -72,7 +73,19 @@ class reader_base : public std::ranges::view_base
7273
* \brief The exact type of the record depends on the options!
7374
* \{
7475
*/
75-
//!\brief The type of the record, a specialisation of bio::record; acts as a tuple of the selected field types.
76+
/*!\brief The type of the record, a specialisation of bio::record.
77+
* \details
78+
*
79+
* ### Example
80+
*
81+
* This snippet demonstrates how to read an interleaved FastQ file and process the read pairs
82+
* together (at every second iteration of the loop):
83+
*
84+
* \snippet test/snippet/detail/reader_base.cpp read_pair_processing
85+
*
86+
* To be able to easily backup the first record of a mate-pair, you need to create a temporary
87+
* variable (`last_record`). This type alias helps define it.
88+
*/
7689
using record_type = record<decltype(options_t::field_ids), decltype(options_t::field_types)>;
7790
//!\brief The iterator type of this view (an input iterator).
7891
using iterator = detail::in_file_iterator<reader_base>;

include/bio/misc.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ namespace bio
3434
* Typically used to configure a class template to have members that are vectors/strings VS members that are views.
3535
* The "shallow" version of such a class is typically cheap to copy (no dynamic memory) while the "deep" version
3636
* is expensive to copy (holds dynamic memory).
37+
*
38+
* See \ref shallow_vs_deep on what this means in practice.
3739
*/
3840
enum class ownership
3941
{

include/bio/record.hpp

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -107,23 +107,22 @@ namespace bio
107107
/*!\brief The class template that file records are based on; behaves like an std::tuple.
108108
* \implements seqan3::tuple_like
109109
* \ingroup bio
110-
* \tparam field_types The types of the fields in this record as a seqan3::type_list.
111110
* \tparam field_ids A vtag_t type with bio::field IDs corresponding to field_types.
111+
* \tparam field_types The types of the fields in this record as a seqan3::type_list.
112112
* \details
113113
*
114-
* This class template behaves just like an std::tuple, with the exception that it provides an additional
114+
* This class template behaves like a std::tuple, with the exception that it provides an additional
115115
* get-interface that takes a bio::field identifier. The traditional get interfaces (via index and
116116
* via type) are also supported, but discouraged, because accessing via bio::field is unambiguous and
117117
* better readable.
118118
*
119-
* ### Example
119+
* In addition to the get()-interfaces, member accessors are provided with the same name as the fields.
120120
*
121-
* For input files this template is specialised automatically and provided by the file via its `record_type` member.
122-
* For output files you my define it locally and pass instances of this to the output file's `push_back()`.
121+
* See bio::seq_io::reader for how this data structure is used in practice.
123122
*
124-
* This is how it works:
123+
* See #make_record() and #tie_record() for easy ways to create stand-alone record variables.
125124
*
126-
* \todo include test/snippet/io/record_2.cpp
125+
* See the \ref record_faq for more details.
127126
*/
128127
template <typename field_ids_, typename field_types_>
129128
struct record : seqan3::detail::transfer_template_args_onto_t<field_types_, std::tuple>
@@ -372,15 +371,22 @@ auto const && get(record<field_ids, field_types> const && r)
372371
// make_record
373372
//-------------------------------------------------------------------------------
374373

375-
/*!\brief Create a bio::record and deduce type from arguments (like std::make_tuple for std::tuple).
374+
/*!\brief Create a deep bio::record from the arguments (like std::make_tuple for std::tuple).
375+
* \param[in] tag A tag that specifies the identifiers of the subsequent arguments.
376+
* \param[in] fields The arguments to put into the record.
377+
* \returns A bio::record with copies of the field arguments.
376378
* \details
377379
*
380+
* The record will contain copies of the arguments.
381+
*
382+
* For more information, see \ref record_type and \ref record_make_tie
383+
*
378384
* ### Example
379385
*
380-
* TODO
386+
* \snippet test/snippet/record.cpp make_and_tie_record
381387
*/
382388
template <auto... field_ids, typename... field_type_ts>
383-
constexpr auto make_record(vtag_t<field_ids...>, field_type_ts &&... fields)
389+
constexpr auto make_record(vtag_t<field_ids...> BIO_DOXYGEN_ONLY(tag), field_type_ts &&... fields)
384390
-> record<vtag_t<field_ids...>, seqan3::type_list<std::decay_t<field_type_ts>...>>
385391
{
386392
return {std::forward<field_type_ts>(fields)...};
@@ -390,15 +396,22 @@ constexpr auto make_record(vtag_t<field_ids...>, field_type_ts &&... fields)
390396
// tie_record
391397
//-------------------------------------------------------------------------------
392398

393-
/*!\brief Create a bio::record of references (like std::tie for std::tuple).
399+
/*!\brief Create a shallow bio::record from the arguments (like std::tie for std::tuple).
400+
* \param[in] tag A tag that specifies the identifiers of the subsequent arguments.
401+
* \param[in] fields The arguments to represent in the record.
402+
* \returns A bio::record with references to the field arguments.
394403
* \details
395404
*
405+
* The record will contain references to the arguments.
406+
*
407+
* For more information, see \ref record_type and \ref record_make_tie
408+
*
396409
* ### Example
397410
*
398-
* TODO
411+
* \snippet test/snippet/record.cpp make_and_tie_record
399412
*/
400413
template <auto... field_ids, typename... field_type_ts>
401-
constexpr auto tie_record(vtag_t<field_ids...>, field_type_ts &... fields)
414+
constexpr auto tie_record(vtag_t<field_ids...> BIO_DOXYGEN_ONLY(tag), field_type_ts &... fields)
402415
-> record<vtag_t<field_ids...>, seqan3::type_list<field_type_ts &...>>
403416
{
404417
return {fields...};

include/bio/seq_io/reader.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,11 @@ namespace bio::seq_io
9292
* at the first whitespace:
9393
* \snippet test/snippet/seq_io/seq_io_reader.cpp options
9494
*
95+
* If you need to modify or store the records, request *deep records* from the reader:
96+
* \snippet test/snippet/seq_io/seq_io_reader.cpp options2
97+
*
98+
* For more information on *shallow* vs *deep*, see \ref shallow_vs_deep
99+
*
95100
* For more advanced options, see bio::seq_io::reader_options.
96101
*/
97102
template <typename... option_args_t>

include/bio/seq_io/reader_options.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,8 @@ inline constinit auto field_types_protein = field_types<ownership::shallow, seqa
104104
* \details
105105
*
106106
* Configures a shallow record where sequence and quality data are plain characters.
107+
* This can be used in cases where the application needs to handle nucleotide *and*
108+
* protein data.
107109
*/
108110
inline constinit auto field_types_char = field_types<ownership::shallow, char, char>;
109111
//!\}

include/bio/var_io/reader.hpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,7 @@ namespace bio::var_io
5555
* are returned by default also correspond to VCF specification (i.e. 1-based positions, string as strings and not
5656
* as numbers) **with one exception:** the genotypes are not grouped by sample (as in the VCF format) but by
5757
* genotype field (as in the BCF format).
58-
* This results in a notably better performance when reading BCF files. See below for information on how to change
59-
* this.
58+
* This results in a notably better performance when reading BCF files.
6059
*
6160
* This reader supports the following formats:
6261
*

0 commit comments

Comments
 (0)