A long time ago, I was thinking about systems that perceive only video input. But the more I read about image processing, the more I get a notion that sole video is not enough.
The system must be able to change its view position according to its internal directive. It also must move in the world and have sensor feedback about its movements. This will add so much more context that no manual object segmentation will be required.
For more information on why more sensor data help to understand the world better, refer to Embodied Cognition.
However, if you want to focus on a visual processing system, then I would recommend looking at YOLO v4.
It’s written in C. This model made a breakthrough in real-time object detection in 2012.
Do you know what I like about this model? It's easily compiled on Windows, and highly optimized for different GPUs. But I like it because it implements many biological principles one can find in Hubel’s book “Eye, Brain, and Vision”.
Main questions
Image size
What will be the spatial map between an image with resolution of 1024x1024 pixels and rods and cells in the human eye assuming that the image covers full field of view? In such model one pixel will correspond to an area of rods and cones. In order to keep the system small and efficient, we will keep only one rod and cones according to its distribution in retina. How many photo receptors do we need for such model?
Halftone images
Human eyes not exactly percieve the world as a matrix of RGB pixels. There are several types of retinal ganglion cells. What if we convert images using halftone technique (python 1, python 2, c opencv). Also: Rod and Cone Connections With Bipolar Cells in the Rabbit Retina
How does a convolution kernel get trained?
2D convolution is a matrix-matrix multiplication. See here with pictures and formulas. And see visualization of active neurons and conv filters.
Papers
A good list compiled in this post on Towards Science
- Network in Network Link
- A guide to convolution arithmetic for deep learning Link
- Deconvolution and Checkerboard Artifacts Link
- Multi-Scale Context Aggregation by Dilated Convolutions Link
- ResNeXt: Aggregated Residual Transformations for Deep Neural Networks Link
- Going deeper with convolutions Link
- Flattened convolutional neural networks for feedforward acceleration Link
- Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs Link
- Xception: Deep Learning with Depthwise Separable Convolutions Link
- Rethinking the Inception Architecture for Computer Vision Link
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Link
- ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices Link
Blogs
- Convolutional Neural Networks backpropagation: from intuition to derivation by grzegorz gwardys
- Backpropagation In Convolutional Neural Networks by Jefkine