AV@CAR Database

AV@CAR is a free multichannel multimodal database for automatic audio-visual speech recognition, including both studio and in-car recordings. The audio database is composed of seven audio channels including, clean speech (captured using a close talk microphone), noisy speech from several microphones placed on the overhead of the cabin, noise only signal coming from the engine compartment and information about the speed of the car. For the video database, a small video camera sensible to the visible and the near infrared bands is placed on the windscreen and used to capture the face of the driver. Recordings include under different light conditions, both during the day and at night. Additionally, the same individuals were recorded in laboratory, under controlled environment conditions to obtain noise free speech signals, 2D images and 3D + texture face models.

A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel, L. Buera, and E. Zacur. AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In IV Inernational Conference on Language Resources and Evaluation (LREC), volume 3, pages 763–767, Lisbon, Portugal, 2004

Audio-Visual-Lab

The Audio-Visual-Lab data set of AV@CAR contains sequences of 20 people recorded under controlled conditions while repeating predefined phrases or sentences. There are 197 sequences for each person, recorded in AVI format. The video data has a spatial resolution of 768x576 pixels, 24-bit pixel depth and 25 fps and is compressed at an approximate rate of 50:1. The audio data is a single channel at 16 bits and 32 kHz sampling frequency. There are 5 additional audio-channels that are supplied separately from the video.
The sequences are divided into 9 sessions and were captured in a frontal view under different illumination conditions and speech tasks. The table below summarizes the sequences on the Audio-Visual-Lab dataset:

SessionVideosIlluminationDescription of speech taskSample
118FRLong textVideo_1
2 25 FR Phonetically-balanced phrases Video_2
3 25 LAT Phonetically-balanced phrases Video_3
4 25 FR Application-oriented phrases Video_4
5 19 FR Word spelling Video_5
6 10 FR Isolated digits Video_6
7 25 FR Names of streets, cities and countries Video_7
8 25 IR + BKG Phonetically-balanced phrases Video_8
9 25 IR Phonetically-balanced phrases Video_9
FR = Frontal illumination with white light
LAT = Lateral illumination with white light (from the speaker's left)
IR = Frontal Infra-red (near band) illumination
BKG = Background illumination (ambient light)


The phrases were selected specifically for each speaker, with repetitions among different persons minimized (except for the balanced phrases). For sessions 2, 3, 8 and 9 each speaker used a single 25-phraes set. That is, the same 25 phrases were repeated under 4 different illumination conditions. The total set of these phonetically-balanced phrases comprises 250 sentences, from which 25 were randomly chosen to be spoken by each of the 20 individuals.

Equipment

Equipment image

For the laboratory sessions inexpensive electret microphones were used along with a voice array microphone VA 2000 (GN NetCom, Denmark), a closetalk (head-worn) microphone C 477 W R (AKG, Austria) and one Q501T (AKG, Austria). In order to capture far-field speech, the two electret microphones are placed in the upper corners of a 1.86x2.86x2.11 m cabin. The NetCom array and the Q501T microphone are placed in front of the speaker, 1 meter far from him. The audio part of the database is sampled at 16 KHz and 16 bits for each channel. For the video acquisition a small V-1204A camera (Marshall Electronics, USA) sensitive to the visible and the near infrared bands was placed in front of the speaker. The camera board includes six infra red LEDs which guarantee enough illumination without any other lighting source. This LEDs were used as the illumination source for two of the nine recording sessions

Audio-Visual-Car

The Audio-Visual-Car data set of AV@CAR contains sequences of 20 people recorded under real driving conditions, while repeating a predefined phrases or sentences. There are 157 sequences for each person, recorded in AVI format. The video data has a spatial resolution of 768x576 pixels, 24-bit pixel depth at 25 fps and is compressed at an approximate rate of 50:1. The audio data is a single channel at 16 bits and 32 kHz sampling frequency. There are 7 additional audio-channels that are supplied separately from the video.
The sequences are divided into 7 sessions. The first two sessions were captured with the speaker reading the phrases, the car parked and the engine turned off. Sessions 3 to 7 were captured while the speaker was driving the car. In these sessions the phrases were first read aloud by an assistant and then repeated by the driver. For each speaker, the text to repeat was the same as in the Lab recordings, except for Sessions 5 and 6 that were simplified due to the need to avoid distraction of the driver.
Moreover, from the 5 driving sessions, 2 were recorded while driving in the city and the other 3 were recorded in the high-way. The sessions selected as city-sessions were changed for the different speakers, so that for each session (from 3 to 7) there were 8 speakers recorded in the city and 12 recorded in the high-way. While the total recorded videos contain around 15 minutes data for each speaker, the total recording time for each session was between 1 and 1,5 hours, depending on the driver and the traffic conditions. To allow for different illumination conditions, some speakers were recorded during the night and others during the day, at different times and weather conditions.

SessionVideos Car engine Description of speech task Sample
1 18 Off Long text (*) Video_1
2 25 Off Phonetically-balanced phrases Video_2
3 25 On Phonetically-balanced phrases Video_3
4 25 On Application-oriented phrases Video_4
5 19 On Word spelling Video_5
6 10 On Isolated digits Video_6
7 25 On Names of streets, cities and countries Video_7

Equipment

In order to acquire seven synchronized audio channels and the information about the speed of the car, the use of a notebook with a standard sound car was declined. Instead of that, the Audio PCM+ board (Bittware, USA) was used, with eight 24-bit input channels and eight 24-bit output channels. This board allows for high speed data transfer from or to the host PC where the corpus is stored. The recordings were made using a DC-Power supplied PC connected to a 12 V battery isolated form the electrical system of the car to avoid electrical noise. We used Q501T microphones (AKG, Austria) for the in-vehicle acquisition because of its high pass filter characteristic that make them appropriate for the car environment. Six microphones were placed in the overhead of the car, three of them over the front seats and the rest over the rear seats. Clean speech is captured by means of a close-talk (head-worn) microphone C444L (AKG, Austria). One input audio channel is used to acquire engine noise thanks to an electret microphone placed into the engine compartment. Synchronized car speed information is also added to the AV@CAR corpus while being acquired using one audio input of the acquisition board. For the video database, a small V-1204A camera (Marshall Electronics, USA) sensitive to the visible and the near infrared bands is placed on the windscreen besides the rear mirror and used to capture the face of the driver. The camera board includes six infra red LEDs which guarantee enough illumination without any other lighting source. This makes it possible to deal with different recording conditions, both during the day and at night. The images provided are black and white, digitized to a spatial resolution of 768x576 pixels, 8-bit pixel depth, and frame rate of 25 fps, using a DT3120 Frame Grabber (Data Translation Inc., USA). This somewhat high resolution is justified considering that the camera does not have to capture only the head, but it also has to allow for free movements of the driver within the field of view of the camera while still getting the whole face.
One of the challenging tasks of this database was to ensure synchronization between the audio part and the video part. To make this independent from hardware delays or errors, an array of eight red LEDs is captured by the camera in a corner of each frame. These LEDs are lighted up and turned off sequentially, every 5 ms, according to a synch signal generated by the audio board. This also produces an acoustic signal through the loudspeakers of the car or the lab registered by the microphones at the beginning of each recording. The LEDs array is still blinking during the whole video every 1 second intervals controlled by the audio board, to allow synchronism verification.

The car used for the recordings Video camera placed besides the rear mirror
The three frontal microphones The three rear microphones

Static Images

The Static Images dataset from AV@CAR contains color images of 40 people in PNG format (with lossless compression) at a spatial resolution of 768x576 pixels, 24-bit pixel depth. There are 19 shots per subject, showing multiple viewpoints, expressions and occlusion by glasses, that can be summarized as follows: Since the images were obtained from a video camera, there are cases in which for each shot there are several images available. Clear cases of this situation are the 4 changing rotation shots, for which there are several images with small angular displacements between them. Please see the Recording Protocol for further details. Up to date, there is only one recording session for 36 people, and two session for the other 4 people, separated in time by approximately 4 months. These data can be obtained for free alone with annotations on some of the images. Due to the (huge) amount of data that must be classified, we are asking for help with the annotation of the database as a counterpart for the data. Please see the details here (click for a larger version)

3D Data

The 3D dataset from AV@CAR contains 4D face-data for 40 people (4D meaning 3D to describe the surface plus an intensity gray level information to describe the texture). The 4D data is described in VRML standard format. The resources needed to store the data are approximately 10 MB for VRML file and 2.5 MB for VTK + texture information, for each person and for each expression. Therefore, the total 3D storage requirements for each person, including VRML, VTK, texture maps, is about 165 MB / person. There are 13 shots per subject, which can be summarized as follows:




Each shot of the 3D acquisition was recorded in a color video sequence. They are stored in AVI format, with a compression rate of approximately 50:1, spatial resolution of 768x576 pixels, 24-bit pixel depth and 25 fps.
Additionally, the data of each sequence was processed (before compression to AVI) to select a number of images to be saved in PNG: these are the images of the Static Images dataset The criterion of this selection is not to keep an image unless it is significantly different from the previously stored ones. Therefore, the quantity of images kept from each shot depend on how much the subject moved during the acquisition and is always greater or equal to one. In other words, for each task where the video sequence was captured, we store the full video in compressed format and the most significant frames undamaged by the compression.

Example of video recordings of the 3D acquisition (click for video)

Equipment

A total of seven cameras were used for the acquisitions: six for the 3D reconstruction and one for independent 2D recording. The 3D scanner (BiPod, Vision RT, UK) is a system with several cameras and its 3D surface reconstruction is based on structured light, in which a pattern is projected on the faces. Four cameras serve to the surface reconstruction which, with the help of the projected pattern, can recover the correspondences of points. Moreover, based on the previous calibration of the system, they are able to reconstruct the surfaces. After the pattern lights are turned off, two additional cameras take a picture of each half-side of the face. These photographs are projected onto the acquired surfaces and a final 4D face is obtained (4D meaning 3D to describe the surface plus an intensity gray level information to describe the texture).
The reconstruction software is a Vision RT (UK) software, which controls the 3D hardware during the acquisition, processes the images, build the surfaces and compute the texture map. The outputs of the system are triangulated meshes and mapping coordinates from each node to a texture image in OBJ format.
The individual 2D data was recorded using a 1352-5000 (Cohu Inc., USA) color video camera equipped with a Navitar TV zoom lens (12.5-75 mm, F 1.8). The captured video was digitized to have a spatial resolution of 768x576 pixels, 24-bit per pixel color depth and 25 fps by means of a DT-3120 acquisition board (Data Translation Inc., USA).
The figure below schematically shows the distribution of the cameras in the laboratory where recordings took place. Approximate distances among the different elements are also provided. The circles labeled as 1, 2 and 3 represent visible marks placed on the walls of the room to help positioning the head of the subject in different angles with respect to the cameras system.

Diagram

Mark 1 was placed right behind the 2D camera, such that when the subject seated in the chair (represented by the triangle in Fig. 4.2.1) looked at this mark, his frontal view was captured. Marks 2 and 3 were placed at 2.5m away on each side and at the same height as the 2D camera (and Mark 1). As a result, when the subject is looking at them the right and left profile views can be captured.
Two more marks were used which are not displayed in the figure. They were both in front of the subject (as Mark 1), but one was on the roof (4) and the other in the floor (5). Their distance to the central camera is approximately 1.5 meters each, and they help in recording upward and downward head tilting.
The illumination sources that were employed can be divided into two groups, according to their use: permanent and 3D-only. The permanent one was present in all recordings. It was composed from the ambient light of the laboratory, provided by a number of fluorescent tubes on the roof, and two 100W white-light lamps, labeled as "additional illumination". The objective of these sources was to produce a reasonable uniform illumination on the subject's face.

Recording Protocol

Each individual to be recorded was first carefully positioned such that it was correctly captured by all the cameras. After that, 19 video sequences were recorded, according to the tasks detailed in the following table:

TaskName 3D Description
1 Frontal Yes The person is standing with neutral expression looking to Mark 1
2 Turn Left No The person starts looking at Mark 1 and progressively turns to the left until being 90 degrees apart from the starting position (full right profile). Then, he/she turns back (also progressively) until being looking straight at Mark 2.
3 pLeft Yes Standing with neutral expression looking left, at Mark 2
4 Turn Right No Idem TurnLeft but to the right side, starting at Mark 1 and ending at Mark 3.
5 pRight Yes Standing with neutral expression looking right, at Mark 3
6 MoveUp No The person starts looking at Mark 1 and progressively moves up the head until being looking straight at Mark 4.
7 Up Yes Standing with neutral expression looking up, at Mark 4
8 MoveDown No Idem MoveUp but moving down, starting at Mark 1 and ending at Mark 5.
9 DownYesStanding with neutral expression looking down, at Mark 5
10 HappyLooking at Mark 1 while smiling (showing teeth).
11 SurpriseLooking at Mark 1 while showing the referred expression
12 Yawn
13 Anger
14 Disgust
15 Fear
16 Sadness
17 Glasses No Looking at Mark 1 with neutral expression, wearing transparent glasses.
18 SunGlasses Yes Looking at Mark 1 with neutral expression, wearing sunglasses.
19 Happy2 No Looking at Mark 1 while smiling and wearing transparent glasses.


The column "3D" shows that there are a number of tasks for which only a 2D video sequence is recorded. In all other cases, the video sequence starts before the lights of the 3D system are switched on and continues until after they are switched off.
The tasks could be easily divided into three groups: multiple viewpoints (1-9), multiple expressions (1 and 10-16) and occlusion by glasses (17-19). The most difficult group to design has been the one of multiple expressions, since up to date there is no well-established criterion to choose them. The expressions selected for the corpus are based on the gesture classification of [Ekman, 1975] plus another one we have called "Yawn". This addition is based on our work experience with the AR database [Martinez 1998], where there is a similar expression that introduces strong variability into the statistical models we are investigating.

Annotated multi-view dataset

The source to generate our variable-viewpoint dataset (hereinafter VVDB) is the AV@CAR database. This database provides two sets of variable-pose shots. The first one is composed by four shots with the subject looking at specific markers designed to displace the head to the left, right, top and bottom; with respect to a reference shot, labelled as frontal in which the subject is asked to look straight to the camera. The images captured in this way suffer from the fact that the look at instruction has a subjective interpretation. The way different people focus to the same target is variable and the resulting dataset does not present an uniform head rotation among all subjects. This problem can be observed in other datasets acquired under similar strategies. The second set of data from AV@CAR consists on video sequences in which people were asked to rotate their heads, starting from the frontal position, according to the following instructions:
  1. Rotate left until 90 degrees, and then go back to specific marker.
  2. Rotate right until 90 degrees, and then go back to specific marker.
  3. Rotate up until specific marker.
  4. Rotate down until specific marker.
Despite the final (and perhaps the initial) points differ from one person to another, there is a significant amount of intermediate poses that can be used to generate an homogeneous VVDB. These videos, together with the frontal shot, were the material used for the generation of the VVDB. Six left-right rotation angles (three to each side) and two nodding angles (one up, one down) were selected for each of the 40 individuals in the database. The total size of the dataset is therefore of 280 images. The figure below shows examples of the resulting VVDB. For every image a number of key-points (landmarks) were placed manually according to the 98-point template described in [1,2]:

Face sets

The landmarks can be freely downloaded for research purposes here.
The images can be obtained by contacting Eduardo Lleida.
Please cite the following two papers when using the annotations.

References to include in the paper
  1. Multi-view face segmentation using fusion of statistical shape and appearance models, C Butakoff, AF Frangi. Computer Vision and Image Understanding 114 (3), 311-321, 2010
  2. Projective active shape models for pose-variant image analysis of quasi-planar objects: Application to facial analysis, FM Sukno, JJ Guerrero, AF Frangi, Pattern Recognition 43 (3), 835-849, 2010