Simulated Sense of Sight

Written by:

Robert McIntyre

1 JMonkeyEngine natively supports multiple views of the same world.

Vision is one of the most important senses for humans, so I need to build a simulated sense of vision for my AI. I will do this with simulated eyes. Each eye can be independently moved and should see its own version of the world depending on where it is.

Making these simulated eyes a reality is simple because jMonkeyEngine already contains extensive support for multiple views of the same 3D simulated world. The reason jMonkeyEngine has this support is because the support is necessary to create games with split-screen views. Multiple views are also used to create efficient pseudo-reflections by rendering the scene from a certain perspective and then projecting it back onto a surface in the 3D world.

Figure 1: jMonkeyEngine supports multiple views to enable split-screen games, like GoldenEye, which was one of the first games to use split-screen views.

1.1 `ViewPorts`, `SceneProcessors`, and the `RenderManager`.

jMonkeyEngine allows you to create a ViewPort, which represents a view of the simulated world. You can create as many of these as you want. Every frame, the RenderManager iterates through each ViewPort, rendering the scene in the GPU. For each ViewPort there is a FrameBuffer which represents the rendered image in the GPU.

Figure 2: ViewPorts are cameras in the world. During each frame, the RenderManager records a snapshot of what each view is currently seeing; these snapshots are FrameBuffer objects.

Each ViewPort can have any number of attached SceneProcessor objects, which are called every time a new frame is rendered. A SceneProcessor receives its ViewPort's FrameBuffer and can do whatever it wants to the data. Often this consists of invoking GPU specific operations on the rendered image. The SceneProcessor can also copy the GPU image data to RAM and process it with the CPU.

1.2 From Views to Vision

Each eye in the simulated creature needs its own ViewPort so that it can see the world from its own perspective. To this ViewPort, I add a SceneProcessor that feeds the visual data to any arbitrary continuation function for further processing. That continuation function may perform both CPU and GPU operations on the data. To make this easy for the continuation function, the SceneProcessor maintains appropriately sized buffers in RAM to hold the data. It does not do any copying from the GPU to the CPU itself because it is a slow operation.

(defn vision-pipeline
  "Create a SceneProcessor object which wraps a vision processing
  continuation function. The continuation is a function that takes 
  [#^Renderer r #^FrameBuffer fb #^ByteBuffer b #^BufferedImage bi],
  each of which has already been appropriately sized."
  [continuation]
  (let [byte-buffer (atom nil)
        renderer (atom nil)
        image (atom nil)]
  (proxy [SceneProcessor] []
    (initialize
     [renderManager viewPort]
     (let [cam (.getCamera viewPort)
           width (.getWidth cam)
           height (.getHeight cam)]
       (reset! renderer (.getRenderer renderManager))
       (reset! byte-buffer
             (BufferUtils/createByteBuffer
              (* width height 4)))
        (reset! image (BufferedImage.
                      width height
                      BufferedImage/TYPE_4BYTE_ABGR))))
    (isInitialized [] (not (nil? @byte-buffer)))
    (reshape [_ _ _])
    (preFrame [_])
    (postQueue [_])
    (postFrame
     [#^FrameBuffer fb]
     (.clear @byte-buffer)
     (continuation @renderer fb @byte-buffer @image))
    (cleanup []))))

The continuation function given to vision-pipeline above will be given a Renderer and three containers for image data. The FrameBuffer references the GPU image data, but the pixel data can not be used directly on the CPU. The ByteBuffer and BufferedImage are initially "empty" but are sized to hold the data in the FrameBuffer. I call transferring the GPU image data to the CPU structures "mixing" the image data. I have provided three functions to do this mixing.

(defn frameBuffer->byteBuffer!
  "Transfer the data in the graphics card (Renderer, FrameBuffer) to
   the CPU (ByteBuffer)."  
  [#^Renderer r #^FrameBuffer fb #^ByteBuffer bb]
  (.readFrameBuffer r fb bb) bb)

(defn byteBuffer->bufferedImage!
  "Convert the C-style BGRA image data in the ByteBuffer bb to the AWT
   style ABGR image data and place it in BufferedImage bi."
  [#^ByteBuffer bb #^BufferedImage bi]
  (Screenshots/convertScreenShot bb bi) bi)

(defn BufferedImage!
  "Continuation which will grab the buffered image from the materials
   provided by (vision-pipeline)."
  [#^Renderer r #^FrameBuffer fb #^ByteBuffer bb #^BufferedImage bi]
  (byteBuffer->bufferedImage!
   (frameBuffer->byteBuffer! r fb bb) bi))

Note that it is possible to write vision processing algorithms entirely in terms of BufferedImage inputs. Just compose that BufferedImage algorithm with BufferedImage!. However, a vision processing algorithm that is entirely hosted on the GPU does not have to pay for this convenience.

2 Optical sensor arrays are described with images and referenced with metadata

The vision pipeline described above handles the flow of rendered images. Now, we need simulated eyes to serve as the source of these images.

An eye is described in blender in the same way as a joint. They are zero dimensional empty objects with no geometry whose local coordinate system determines the orientation of the resulting eye. All eyes are children of a parent node named "eyes" just as all joints have a parent named "joints". An eye binds to the nearest physical object with bind-sense.

(in-ns 'cortex.vision)

(defn add-eye!
  "Create a Camera centered on the current position of 'eye which
   follows the closest physical node in 'creature. The camera will
   point in the X direction and use the Z vector as up as determined
   by the rotation of these vectors in blender coordinate space. Use
   XZY rotation for the node in blender."
  [#^Node creature #^Spatial eye]
  (let [target (closest-node creature eye)
        [cam-width cam-height] 
        ;;[640 480] ;; graphics card on laptop doesn't support
                    ;; arbitray dimensions.
        (eye-dimensions eye)
        cam (Camera. cam-width cam-height)
        rot (.getWorldRotation eye)]
    (.setLocation cam (.getWorldTranslation eye))
    (.lookAtDirection
     cam                           ; this part is not a mistake and
     (.mult rot Vector3f/UNIT_X)   ; is consistent with using Z in
     (.mult rot Vector3f/UNIT_Y))  ; blender as the UP vector.
    (.setFrustumPerspective
     cam (float 45)
     (float (/ (.getWidth cam) (.getHeight cam)))
     (float 1)
     (float 1000))
    (bind-sense target cam) cam))

Here, the camera is created based on metadata on the eye-node and attached to the nearest physical object with bind-sense

2.1 The Retina

An eye is a surface (the retina) which contains many discrete sensors to detect light. These sensors can have different light-sensing properties. In humans, each discrete sensor is sensitive to red, blue, green, or gray. These different types of sensors can have different spatial distributions along the retina. In humans, there is a fovea in the center of the retina which has a very high density of color sensors, and a blind spot which has no sensors at all. Sensor density decreases in proportion to distance from the fovea.

I want to be able to model any retinal configuration, so my eye-nodes in blender contain metadata pointing to images that describe the precise position of the individual sensors using white pixels. The meta-data also describes the precise sensitivity to light that the sensors described in the image have. An eye can contain any number of these images. For example, the metadata for an eye might look like this:

{0xFF0000 "Models/test-creature/retina-small.png"}

Figure 3: The retinal profile image "Models/test-creature/retina-small.png". White pixels are photo-sensitive elements. The distribution of white pixels is denser in the middle and falls off at the edges and is inspired by the human retina.

Together, the number 0xFF0000 and the image image above describe the placement of red-sensitive sensory elements.

Meta-data to very crudely approximate a human eye might be something like this:

(let [retinal-profile "Models/test-creature/retina-small.png"]
  {0xFF0000 retinal-profile
   0x00FF00 retinal-profile
   0x0000FF retinal-profile
   0xFFFFFF retinal-profile})

The numbers that serve as keys in the map determine a sensor's relative sensitivity to the channels red, green, and blue. These sensitivity values are packed into an integer in the order |_|R|G|B| in 8-bit fields. The RGB values of a pixel in the image are added together with these sensitivities as linear weights. Therefore, 0xFF0000 means sensitive to red only while 0xFFFFFF means sensitive to all colors equally (gray).

For convenience I've defined a few symbols for the more common sensitivity values.

(def sensitivity-presets
  "Retinal sensitivity presets for sensors that extract one channel
   (:red :blue :green) or average all channels (:all)"
  {:all    0xFFFFFF
   :red    0xFF0000
   :blue   0x0000FF
   :green  0x00FF00})

2.2 Metadata Processing

retina-sensor-profile extracts a map from the eye-node in the same format as the example maps above. eye-dimensions finds the dimensions of the smallest image required to contain all the retinal sensor maps.

(defn retina-sensor-profile
  "Return a map of pixel sensitivity numbers to BufferedImages
   describing the distribution of light-sensitive components of this
   eye. :red, :green, :blue, :gray are already defined as extracting
   the red, green, blue, and average components respectively."
   [#^Spatial eye]
   (if-let [eye-map (meta-data eye "eye")]
     (map-vals
      load-image
      (eval (read-string eye-map)))))

(defn eye-dimensions 
  "Returns [width, height] determined by the metadata of the eye."
  [#^Spatial eye]
  (let [dimensions
          (map #(vector (.getWidth %) (.getHeight %))
               (vals (retina-sensor-profile eye)))]
    [(apply max (map first dimensions))
     (apply max (map second dimensions))]))

3 Importing and parsing descriptions of eyes.

First off, get the children of the "eyes" empty node to find all the eyes the creature has.

(def
  ^{:doc "Return the children of the creature's \"eyes\" node."
    :arglists '([creature])}
  eyes
  (sense-nodes "eyes"))

Then, add the camera created by add-eye! to the simulation by creating a new viewport.

(in-ns 'cortex.vision)
(defn add-camera!
  "Add a camera to the world, calling continuation on every frame
  produced." 
  [#^Application world camera continuation]
  (let [width (.getWidth camera)
        height (.getHeight camera)
        render-manager (.getRenderManager world)
        viewport (.createMainView render-manager "eye-view" camera)]
    (doto viewport
      (.setClearFlags true true true)
      (.setBackgroundColor ColorRGBA/Black)
      (.addProcessor (vision-pipeline continuation))
      (.attachScene (.getRootNode world)))))

The eye's continuation function should register the viewport with the simulation the first time it is called, use the CPU to extract the appropriate pixels from the rendered image and weight them by each sensor's sensitivity. I have the option to do this processing in native code for a slight gain in speed. I could also do it in the GPU for a massive gain in speed. vision-kernel generates a list of such continuation functions, one for each channel of the eye.

(in-ns 'cortex.vision)

(defrecord attached-viewport [vision-fn viewport-fn]
  clojure.lang.IFn
  (invoke [this world] (vision-fn world))
  (applyTo [this args] (apply vision-fn args)))

(defn pixel-sense [sensitivity pixel]
  (let [s-r (bit-shift-right (bit-and 0xFF0000 sensitivity) 16)
        s-g (bit-shift-right (bit-and 0x00FF00 sensitivity)  8)
        s-b (bit-and 0x0000FF sensitivity)

        p-r (bit-shift-right (bit-and 0xFF0000 pixel)       16)
        p-g (bit-shift-right (bit-and 0x00FF00 pixel)        8)
        p-b (bit-and 0x0000FF pixel)

        total-sensitivity (* 255 (+ s-r s-g s-b))]
        (float (/ (+ (* s-r p-r)
                     (* s-g p-g)
                     (* s-b p-b))
                  total-sensitivity))))

(defn vision-kernel
  "Returns a list of functions, each of which will return a color
   channel's worth of visual information when called inside a running
   simulation."
  [#^Node creature #^Spatial eye & {skip :skip :or {skip 0}}]
  (let [retinal-map (retina-sensor-profile eye)
        camera (add-eye! creature eye)
        vision-image
        (atom
         (BufferedImage. (.getWidth camera)
                         (.getHeight camera)
                         BufferedImage/TYPE_BYTE_BINARY))
        register-eye!
        (runonce
         (fn [world]
           (add-camera!
            world camera
            (let [counter  (atom 0)]
              (fn [r fb bb bi]
                (if (zero? (rem (swap! counter inc) (inc skip)))
                  (reset! vision-image
                          (BufferedImage! r fb bb bi))))))))]
     (vec
      (map
       (fn [[key image]]
         (let [whites (white-coordinates image)
               topology (vec (collapse whites))
               sensitivity (sensitivity-presets key key)]
           (attached-viewport.
            (fn [world]
              (register-eye! world)
              (vector
               topology
               (vec 
                (for [[x y] whites]
                  (pixel-sense 
                   sensitivity
                   (.getRGB @vision-image x y))))))
            register-eye!)))
         retinal-map))))

(defn gen-fix-display
  "Create a function to call to restore a simulation's display when it
   is disrupted by a Viewport."
  []
  (runonce
   (fn [world]
     (add-camera! world (.getCamera world) no-op))))

Note that since each of the functions generated by vision-kernel shares the same register-eye! function, the eye will be registered only once the first time any of the functions from the list returned by vision-kernel is called. Each of the functions returned by vision-kernel also allows access to the Viewport through which it receives images.

The in-game display can be disrupted by all the ViewPorts that the functions generated by vision-kernel add. This doesn't affect the simulation or the simulated senses, but can be annoying. gen-fix-display restores the in-simulation display.

3.1 The `vision!` function creates sensory probes.

All the hard work has been done; all that remains is to apply vision-kernel to each eye in the creature and gather the results into one list of functions.

(defn vision!
  "Returns a list of functions, each of which returns visual sensory
   data when called inside a running simulation."
  [#^Node creature & {skip :skip :or {skip 0}}]
  (reduce
   concat 
   (for [eye (eyes creature)]
     (vision-kernel creature eye))))

3.2 Displaying visual data for debugging.

It's vital to have a visual representation for each sense. Here I use view-sense to construct a function that will create a display for visual data.

(in-ns 'cortex.vision)

(defn view-vision
  "Creates a function which accepts a list of visual sensor-data and
  displays each element of the list to the screen." 
  []
  (view-sense
   (fn 
     [[coords sensor-data]]
     (let [image (points->image coords)]
       (dorun
        (for [i (range (count coords))]
          (.setRGB image ((coords i) 0) ((coords i) 1)
                   (gray (int (* 255 (sensor-data i)))))))
       image))))

4 Demonstrations

4.1 Demonstrating the vision pipeline.

This is a basic test for the vision system. It only tests the vision-pipeline and does not deal with loading eyes from a blender file. The code creates two videos of the same rotating cube from different angles.

(in-ns 'cortex.test.vision)

(defn test-pipeline
  "Testing vision:
   Tests the vision system by creating two views of the same rotating
   object from different angles and displaying both of those views in
   JFrames.

   You should see a rotating cube, and two windows,
   each displaying a different view of the cube."
  ([] (test-pipeline false))
  ([record?]
     (let [candy
           (box 1 1 1 :physical? false :color ColorRGBA/Blue)]
       (world
        (doto (Node.)
          (.attachChild candy))
        {}
        (fn [world]
          (let [cam (.clone (.getCamera world))
                width (.getWidth cam)
                height (.getHeight cam)]
            (add-camera! world cam 
                         (comp
                          (view-image
                           (if record?
                             (File. "/home/r/proj/cortex/render/vision/1")))
                          BufferedImage!))
            (add-camera! world
                         (doto (.clone cam)
                           (.setLocation (Vector3f. -10 0 0))
                           (.lookAt Vector3f/ZERO Vector3f/UNIT_Y))
                         (comp
                          (view-image
                           (if record?
                             (File. "/home/r/proj/cortex/render/vision/2")))
                          BufferedImage!))
            (let [timer (IsoTimer. 60)]
              (.setTimer world timer)
              (display-dilated-time world timer))
            ;; This is here to restore the main view
            ;; after the other views have completed processing
            (add-camera! world (.getCamera world) no-op)))
        (fn [world tpf]
          (.rotate candy (* tpf 0.2) 0 0))))))

YouTube

A rotating cube viewed from two different perspectives.

Creating multiple eyes like this can be used for stereoscopic vision simulation in a single creature or for simulating multiple creatures, each with their own sense of vision.

4.2 Demonstrating eye import and parsing.

To the worm from the last post, I add a new node that describes its eyes.

Figure 4: The worm with newly added empty nodes describing a single eye.

The node highlighted in yellow is the root level "eyes" node. It has a single child, highlighted in orange, which describes a single eye. This is the "eye" node. It is placed so that the worm will have an eye located in the center of the flat portion of its lower hemispherical section.

The two nodes which are not highlighted describe the single joint of the worm.

The metadata of the eye-node is:

(cortex.sense/meta-data
 (.getChild (.getChild (cortex.test.body/worm) "eyes") "eye") "eye")

"(let [retina \"Models/test-creature/retina-small.png\"]
    {:all retina :red retina :green retina :blue retina})"

This is the approximation to the human eye described earlier.

(in-ns 'cortex.test.vision)

(defn change-color [obj color]
  ;;(println-repl obj)
  (if obj
    (.setColor  (.getMaterial obj) "Color" color)))

(defn colored-cannon-ball [color]
  (comp #(change-color % color)
         (fire-cannon-ball)))

(defn gen-worm
  "create a creature acceptable for testing as a replacement for the
   worm."
  []
  (nodify
   "worm"
   [(nodify
     "eyes"
     [(doto
          (Node. "eye1")
        (.setLocalTranslation (Vector3f. 0 -1.1 0))
        (.setUserData
         
         "eye" 
         "(let [retina
                \"Models/test-creature/retina-small.png\"]
                {:all retina :red retina
                 :green retina :blue retina})"))])
    (box
     0.2 0.2 0.2
     :name "worm-segment"
     :position (Vector3f. 0 0 0)
     :color ColorRGBA/Orange)]))



(defn test-worm-vision 
  "Testing vision:
   You should see the worm suspended in mid-air, looking down at a
   table. There are four small displays, one each for red, green blue,
   and gray channels. You can fire balls of various colors, and the
   four channels should react accordingly.

   Keys:
     r  : fire red-ball
     b  : fire blue-ball
     g  : fire green-ball
     <space> : fire white ball"
  
  ([] (test-worm-vision false))
  ([record?] 
     (let [the-worm (doto (worm)(body!))
           vision (vision! the-worm)
           vision-display (view-vision)
           fix-display (gen-fix-display)
           me (sphere 0.5 :color ColorRGBA/Blue :physical? false)
           x-axis
           (box 1 0.01 0.01 :physical? false :color ColorRGBA/Red
                :position (Vector3f. 0 -5 0))
           y-axis
           (box 0.01 1 0.01 :physical? false :color ColorRGBA/Green
                :position (Vector3f. 0 -5 0))
           z-axis
           (box 0.01 0.01 1 :physical? false :color ColorRGBA/Blue
                :position (Vector3f. 0 -5 0))

           ]

       (world
        (nodify [(floor) the-worm x-axis y-axis z-axis me])
        (merge standard-debug-controls
               {"key-r" (colored-cannon-ball ColorRGBA/Red)
                "key-b" (colored-cannon-ball ColorRGBA/Blue)
                "key-g" (colored-cannon-ball ColorRGBA/Green)})
        
        (fn [world]
          (light-up-everything world)
          (speed-up world)
          (let [timer (IsoTimer. 60)]
                (.setTimer world timer)
                (display-dilated-time world timer))
          ;; add a view from the worm's perspective
          (if record?
            (Capture/captureVideo
             world
             (File.
              "/home/r/proj/cortex/render/worm-vision/main-view")))
            
          (add-camera!
           world
           (add-eye! the-worm (first (eyes the-worm)))
           (comp
            (view-image
             (if record?
               (File.
                "/home/r/proj/cortex/render/worm-vision/worm-view")))
            BufferedImage!))
            
          (set-gravity world Vector3f/ZERO)
          (add-camera! world (.getCamera world) no-op))
        
        (fn [world _]
          (.setLocalTranslation me (.getLocation (.getCamera world)))
           (vision-display
            (map #(% world) vision)
            (if record?
              (File. "/home/r/proj/cortex/render/worm-vision")))
          (fix-display world)
          )))))

The world consists of the worm and a flat gray floor. I can shoot red, green, blue and white cannonballs at the worm. The worm is initially looking down at the floor, and there is no gravity. My perspective (the Main View), the worm's perspective (Worm View) and the 4 sensor channels that comprise the worm's eye are all saved frame-by-frame to disk.

5 Demonstration of Vision

YouTube

Simulated Vision in a Virtual Environment

5.1 Generate the Worm Video from Frames

(ns cortex.video.magick2
  (:import java.io.File)
  (:use clojure.java.shell))

(defn images [path]
  (sort (rest (file-seq (File. path)))))

(def base "/home/r/proj/cortex/render/worm-vision/")

(defn pics [file]
  (images (str base file)))

(defn combine-images []
  (let [main-view (pics "main-view")
        worm-view (pics "worm-view")
        blue   (pics "0")
        green  (pics "1")
        red    (pics "2")
        gray   (pics "3")
        blender (let [b-pics (pics "blender")]
                  (concat b-pics (repeat 9001 (last b-pics))))
        background (repeat 9001 (File. (str base "background.png")))
        targets (map
                 #(File. (str base "out/" (format "%07d.png" %)))
                 (range 0 (count main-view)))]
    (dorun
     (pmap
      (comp
       (fn [[background main-view worm-view red green blue gray blender target]]
         (println target)
         (sh "convert"
             background
             main-view "-geometry" "+18+17"    "-composite"
             worm-view "-geometry" "+677+17"   "-composite"
             green     "-geometry" "+685+430"  "-composite"
             red       "-geometry" "+788+430"  "-composite"
             blue      "-geometry" "+894+430"  "-composite"
             gray      "-geometry" "+1000+430" "-composite"
             blender   "-geometry" "+0+0"      "-composite"
             target))
       (fn [& args] (map #(.getCanonicalPath %) args)))
      background main-view worm-view red green blue gray blender targets))))

cd /home/r/proj/cortex/render/worm-vision
ffmpeg -r 25 -b 9001k -i out/%07d.png -vcodec libtheora worm-vision.ogg

6 Onward!

As a neat bonus, this idea behind simulated vision also enables one to capture live video feeds from jMonkeyEngine.
Now that we have vision, it's time to tackle hearing.

#+appendix

7 Headers

(ns cortex.vision
  "Simulate the sense of vision in jMonkeyEngine3. Enables multiple
  eyes from different positions to observe the same world, and pass
  the observed data to any arbitrary function. Automatically reads
  eye-nodes from specially prepared blender files and instantiates
  them in the world as actual eyes."
  {:author "Robert McIntyre"}
  (:use (cortex world sense util))
  (:import com.jme3.post.SceneProcessor)
  (:import (com.jme3.util BufferUtils Screenshots))
  (:import java.nio.ByteBuffer)
  (:import java.awt.image.BufferedImage)
  (:import (com.jme3.renderer ViewPort Camera))
  (:import (com.jme3.math ColorRGBA Vector3f Matrix3f))
  (:import com.jme3.renderer.Renderer)
  (:import com.jme3.app.Application)
  (:import com.jme3.texture.FrameBuffer)
  (:import (com.jme3.scene Node Spatial)))

(ns cortex.test.vision
  (:use (cortex world sense util body vision))
  (:use cortex.test.body)
  (:import java.awt.image.BufferedImage)
  (:import javax.swing.JPanel)
  (:import javax.swing.SwingUtilities)
  (:import java.awt.Dimension)
  (:import javax.swing.JFrame)
  (:import com.jme3.math.ColorRGBA)
  (:import com.jme3.scene.Node)
  (:import com.jme3.math.Vector3f)
  (:import java.io.File)
  (:import (com.aurellem.capture Capture RatchetTimer IsoTimer)))

8 Source Listing

This org file

source-repository

9 Next

I find some ears for the creature while exploring the guts of jMonkeyEngine's sound system.