January 2025
Notes about OpenCL interfaces. How these work will impact the Sensory Atomese design.
-
SVM "Shared Virtual Memory" shares vector data between GPU and CPU, avoiding explicit copy semantics. The ctors need
cl::Context. Older hardware doesn't support SVM. -
Non-SVM vector I/O is done with a
cl::Bufferobject (per vector). Explicit copyin/copyout is needed for each access. Thecl::Buffer()ctor requires acl::Context. -
The kernel itself needs only a
cl::Programwhich holds the actual kernel code. -
Kernels are executed by placing them onto a
cl::CommandQueue. This queue takes both acl::Contextand also acl::Devicein it ctor. Kernels are executed async. -
Programs written in
*.clor*.clcppcan be "offline compiled" and placed into an*.spvbinary file. Requires OpenCL version 2.0.
Different ideas for communicating with GPUs.
-
Use
GroundedSchemaNodes to wrap kernels. Old-style, yucky. Why yucky? Because its stateless: besides the string name of the node, nothing is stored in the atom. There's no open/close phasing. Its just a function call. Maps poorly onto stateful I/O. -
Use
StorageNodes to wrap kernels. This presents other issues. First, traditionalStorageNodeswere meant to send/receive atoms, while here, we need to send/receive vectors represented asFloatValues. Maybe other kinds of vectors, but for now,FloatValues. The other problem is that theStorageNodeprovides a BackingStore API that is not suitable for operating on vectors.Update (August 2025):
StorageNodehas been modified to useObjectNodeas its base class, and it now behave like an object. TheBackingStoremust necessarily remain in place. -
Create something new, inspired by
BackingStore. This removes some of the issues with StorageNodes, but leaves others in place. One is that, again, the API sits outside of Atomese. The other is that this becomes yet another engineered solution. Of course, engineered solutions are inescapable, but... for example: the traditional Unix open/close/read/write 4-tuple provides an abstraction that works very well for many I/O tasks: open/close to manage the channel, and read/write to use it. But the OpenCL interfaces do not map cleanly to open/close/read/write. They use a more complex structure.Update (August 2025): the new idea behind
ObjectNodeis that such atoms can respond to arbitrary messages. This has now been implemented in https://github.com/opencog/sensory and seems to work reasonably well.
The abstraction I'm getting is this: open near point, open far point, select far-point message recipient, send message. That mapping would have pseudocode like so:
open_near_point() {
Using cl::Device
Create cl::Context and retain pointer to it
}
open_far_point() {
Using cl::Context from near point, and externally specified
cl::Device, create a cl::CommandQueue. Retain pointer to the
cl::CommandQueue. Its the commo handle.
}
select_recipient() {
Use the externally supplied cl::Program, kernel name string,
number of inputs, outputs, and size.
Create kernel, retain pointer to it.
Call cl::Kernel::setArgs() to wire up the kernel.
}
send() {
Using the selected recipient,
Call cl::CommandQueue::enqueueWriteBuffer() to send data.
Wait on cl::Event
)
exec() {
Using the selected recipient,
Call cl::CommandQueue::enqueueNDRangeKernel() to run kernel.
Wait on cl::Event
)
recv() {
Using the selected recipient, wait on cl::Event
Call cl::CommandQueue::enqueueReadBuffer() to get data.
Wait on cl::Event
}
Caveat: the above is already an oversimplification of the OpenCL
interfaces, because a cl::Context is not created out of thin air,
but requires a vector of cl::Device in it's ctor. And devices need
cl::Platform. Lets ignore these additional complexities, for the
moment.
The above API is more complex than open/close/read/write. There are four choices:
- Collapse the first three steps into a generic
open(). Collapse the send and exec steps into a genericwrite(). - The multi-stage open-near, open-far, select-recipient is fairly generic for network communications. e.g. for tcp/ip, open-near corresponds to opening local socket (initializing Ethernet adapter) open-far corresponds to the remote socket, and select-recipient corresponds to selecting the port number. This multi-stage open-and-contact-recipient can be codified as "sufficiently generic", claiming that e.g. CUDA would also fit into this model.
- Recognize the above as a cascade of opens(), each requiring the prior
so that it resembles the peeling back of an onion. That is, provide
three distinct objects, each having a single
open()on it, and requiring these to be chained, in order to open channel to far point station. - Recognize that the peeling-of-an-onion model is too 'linear', and that
there is a network of interactions between the various
cl::classes. That network is not a line, but a DAG. Encode the DAG as Atomese. That is, create an Atom that is more-or-less in 1-1 correspondence with the OpenCL classes. Communication then requires connecting the Atomese DAG in the same way that the OpenCL API expects the connections.
The first option seems easiest. The fourth option seems most generic. How would this work, in practice?
Four choices for wiring diagrams:
ExecutionLink-- function-call-likeRuleLink-- inference-likeFilterLink-- electronics-circuit-likeSection-- grammatical/sheaf-like
Pros and cons:
Olde-school. Uses GroundedSchemaNode for the function name, and a
list of inputs, but no defined outputs. Can be given a TV to indicate
probability, but no clear-cut interpretation of the function arguments.
Replaced by EdgeLink for performance/size.
Originally intended to be a base class for rewriting (for PLN).
- Fairly compact.
- Distinct "input" and "output" areas.
- Rules are nameless (anonymous) with no built-in naming mechanism.
- Rules are explicitly heterosexual (everything is either an input, or an output, and must be one or the other.) This is problematic if inputs or outputs are to be multiplexed, or given non-directional (monosexual) connections.
- No explicit connection/gluing semantics. There is implicit glueing via conventional proof theory.
RuleLinks specify term rewrites. Ideal for forward-chained rewriting.
With great difficulty, can be backwards-chained. That this was a difficulty
was exposed by PLN development.
The FilterLink is used to specify a filter that can be applied to a
stream. The stream can be Atoms or Values or both. (This distinguishes
it from the QueryLink, for which the source and target must lay within
the AtomSpace itself.)
Wrapping a RuleLink with a FilterLink is an effective for specifying
rewrite rules to be applied to the stream that the filter is filtering.
Problem is that this notion is very fine-grained. The filter can accept
one or more streams as input, apply a rewrite, and generate one or more
outputs. The inputs and the outputs are streams, but to specify the
"wiring diagram", in Atomese, those streams are necessarily anchored
at well-known locations (typically, under some key on some
AnchorNode.)
This makes processing with FilterNodes resemble an electronics circuit
diagram. Each "dot" in the circuit is some Atom-with-Key location, that
must be explicitly specified. Each circuit element, spanning two or more
dots, is a FilterLink + RuleLink combo. The connection to a "dot"
is done with a ValueOfLink that explicitly names the dot.
Async elements, such as QueueValues exist and work. This allows
readers to stall and wait for input. mux and demux are not an issue:
QueueValues automatically demux, and a mux can be created with a
RuleLink that simply copies one input to two (or more) places.
Arbitrary processing DAG's can be built in this way. Running on the host CPU. The proximal question is: can this be converted into a system for wiring up flows on GPU's?
Sections and connectors were originally invented to generalize the concept of tensors, by allowing the specification of arbitrary star-shaped patterns aka semi-assembled jigsaws. The types of the connectors are loosened: instead of having just two types, "input" or "output", there can be any number of types. The connection rules are loosened as well: instead of assuming "heterosexual" rules, where outputs must connect to inputs (only), one can have arbitrary mating rules. Thus, a connection is allowed, if the mating rules allow it: typically, the types must match, and the "sex" of the connectors must be compatible.
As abstract mathematics, this is a very powerful abstraction. As a programming API, it is verbose and hard to work with. In particular, a wiring engine is needed. Such a wiring engine has not yet been created.
It is not particularly hard to specify a GPU kernel as a sheaf section, but perhaps verbose. But that just specifies (describes) it. It also needs to be wired in place. How?
For a prototype proof-of-concept demo, what's actually needed?
-
Basic vector I/O to GPU's. Minimal function is to move data to GPU's, call kernel, get data back.
-
Multi-step dataflow wiring. Define sophisticated data flow wiring. Similar to TensorFlow (???)
-
AtomSpace on GPU. Run AtomSpace on GPU. Use
StorageNodeAPI to talk to it.
For bootstrapping, lets stick to the basics.
Ability to stream data data to GPU. One-shot is a special case.
Pseudocode:
(Update August 2025: the pseudocode below uses the now-obsolete
sensory-v0 interfaces. These have been redesigned, and sensory
version "half" is now in production. See the examples
directory for working code.)
; Location of kernel code as path in local filesystem.
; OpenCLNode isA SensoryNode
(SensoryNode "opencl://platform:device/file/path/kernel.cl")
(SensoryNode "opencl://platform:device/file/path/kernel.clcpp")
(SensoryNode "opencl://platform:device/file/path/kernel.spv")
; Apply standard Open to SensoryNode
; Returns stream handle, which must be stored at some anchor.
(Open
(Type 'OpenclStream)
((Sensory "opencl://Clover:AMD/tmp/vec-mult.cl"))
; Write two vectors to GPU. Apply 'vect_mult' kernel function to
; the vectors. The 'List' will typically be a 'ListValue'.
(WriteLink
(ValueOf anchor)
(List
(Predicate "vect_mult")
(Number 1 2 3 4)
(Number 2 2 2 2)))
; Place output stream at well-known location
(cog-set-value!
(Anchor "some anchor point")
(Predicate "some key")
(StreamValue "0 0 0"))
; Copy results from GPU to AtomSpace.
(ReadLink
(ValueOf anchor)
(ValueOf (Anchor "some anchor point") (Predicate "some key")))
; TODO RuleLink + FilterLink that provides an all-in-one wrapper for
; above. See sensory examples for example.