Previous studies have shown that spike-timing-dependent plasticity (STDP) can be used in spiking neural networks (SNN) to extract visual features of low or intermediate complexity in an unsupervised manner. These studies, however, used relatively shallow architectures, and only one layer was trainable. Another line of research has demonstrated – using rate-based neural networks trained with back-propagation – that having many layers increases the recognition robustness, an approach known as deep learning. We thus designed a deep SNN, comprising several convolutional (trainable with STDP) and pooling layers. We used a temporal coding scheme where the most strongly activated neurons fire first, and less activated neurons fire later or not at all. The network was exposed to natural images. Thanks to STDP, neurons progressively learned features corresponding to prototypical patterns that were both salient and frequent. Only a few tens of examples per category were required and no label was needed. After learning, the complexity of the extracted features increased along the hierarchy, from edge detectors in the first layer to object prototypes in the last layer. Coding was very sparse, with only a few thousands spikes per image, and in some cases the object category could be reasonably well inferred from the activity of a single higher-order neuron. More generally, the activity of a few hundreds of such neurons contained robust category information, as demonstrated using a classifier on Caltech 101, ETH-80, and MNIST databases. We also demonstrate the superiority of STDP over other unsupervised techniques such as random crops (HMAX) or auto-encoders. Taken together, our results suggest that the combination of STDP with latency coding may be a key to understanding the way that the primate visual system learns, its remarkable processing speed and its low energy consumption. These mechanisms are also interesting for artificial vision systems, particularly for hardware solutions.
Primate’s visual system solves the object recognition task through hierarchical processing along the ventral pathway of the visual cortex (DiCarlo, Zoccolan, & Rust, 2012). Through this hierarchy, the visual preference of neurons gradually increases from oriented bars in primary visual cortex (V1) to complex objects in inferotemporal cortex (IT), where neural activity provides a robust, invariant, and linearly-separable object representation (DiCarlo & Cox, 2007; DiCarlo et al., 2012). Despite the extensive feedback connections in the visual cortex, the first feed-forward wave of spikes in IT (∼ 100–150 ms post-stimulus presentation) appears to be sufficient for crude object recognition (Hung, Kreiman, Poggio, & DiCarlo, 2005; Liu, Agam, Madsen, & Kreiman, 2009; Thorpe, Fize, Marlot, et al., 1996). During the last decades, various computational models have been proposed to mimic this hierarchical feed-forward processing (Fukushima, 1980; LeCun & Bengio, 1998; Lee, Grosse, Ranganath, & Ng, 2009; Masquelier & Thorpe, 2007; Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007). Despite the limited successes of the early models (Ghodrati, Farzmahdi, Rajaei, Ebrahimpour, & Khaligh-Razavi, 2014; Pinto, Barhomi, Cox, & DiCarlo, 2011), recent advances in deep convolutional neural networks (DCNN) led to high performing models (Krizhevsky, Sutskever, & Hinton, 2012; Simonyan & Zisserman, 2014; Zeiler & Fergus, 2014). Beyond the high precision, DCNNs can tolerate object variations as humans do (Kheradpisheh, Ghodrati, Ganjtabesh, & Masquelier, 2016a, b), use IT-like object representations (Cadieu et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014), and match the spatiotemporal dynamics of the ventral visual pathway (Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016). Although the architecture of DCNNs is somehow inspired by the primate’s visual system (LeCun, Bengio, & Hinton, 2015) (a hierarchy of computational layers with gradually increasing receptive fields), they totally neglect the actual neural processing and learning mechanisms in the cortex. The computing units of DCNNs send floating-point values to each other which correspond to their activation level, while, biological neurons communicate to each other by sending electrical impulses (i.e., spikes). The amplitude and duration of all spikes are almost the same, so they are fully characterized by their emission time. Interestingly, mean spike rates are very low in the primate visual systems (perhaps only a few of hertz Shoham, OConnor, & Segev, 2006). Hence, neurons appear to fire a spike only when they have to send an important message, and some information can be encoded in their spike times. Such spike-time coding leads to a fast and extremely energy-efficient neural computation in the brain (the whole human brain consumes only about 10–20 W of energy Maass, 2002).