data files - nsteinme/steinmetz-et-al-2019 GitHub Wiki

Data format and general notes about files

Extraction

If you downloaded the files as one big zip file from figshare, you will extract that zip file to get "allData.tar", extract that tar file to get individual .tar files for each recording, and then extract each of these. It is convenient to do this last step using WinRAR in Windows by selecting all, right-clicking on one of them, and choosing "extract each to separate folder".

File naming

Files are named using the "ALF" naming convention, which you can read about in detail here. The main point is that each file contains a "property" of an "object", and the file name is objectName.propertyName.extension. The properties of a given object share the same number of n data elements (specifically, the same number of rows). So spikes.times and spikes.clusters give two properties of the spikes object, and each have one entry for each spike. In rare cases an underscore in the property name is used to indicate a sub-property, e.g. trials.visualStim_contrastLeft.

Files about timing

The only files with special content relate to timing: .times, .intervals, and .timestamps. These contain:

.times: a length-n vector of time values in seconds
.intervals: an (n x 2) array of start and end times in seconds
.timestamps: an (m x 2) array where the first column gives sample numbers or frame numbers at which timestamps are known, and the second column gives the timestamps of those samples. It is understood that time times of any unspecified samples should be linearly interpolated between those given. Thus the simplest such file, for an evenly sampled object, has m=2, and the first column is [1 n], and the second column is [firstSampleTime lastSampleTime].

File formats and how to load

The file formats are 'npy', 'tsv', or 'mj2'.

Npy refers to numpy files, which can be natively loaded in python, or loaded using "readNPY" from the npy-matlab repository in Matlab.

Tsv are tab-separated value text files with a header row, and can be read with any text file reader (like vim or notepad) or loaded with standard commands, for example in matlab: >> t = readtable('probes.rawFilename.tsv', 'FileType', 'text', 'Delimiter', '\t').

Mj2 are video files in the Motion JPEG 2000 codec. In matlab they can be read with VideoReader.

File contents

Here units (which are SI when applicable) are given in square brackets and file sizes in parentheses, e.g. [mm^2] (nFrames, 2) would be a property with two values for each of nFrames, and in units of millimeters squared. All times are in [s].

Behavioral data

eye. Features extracted from the video of the right eye.
- area.npy : [arb. units] (nFrames) The area of the pupil extracted with DeepLabCut. Note that it is relatively very small during the discrimination task and during the passive replay because the three screens are medium-grey at this time and black elsewhere - so the much brighter overall luminance levels lead to relatively constricted pupils.
- blink.npy : [logical] (nFrames) Times when a blink was detected, to be excluded from analysis.
- xyPos.npy : [arb. units] (nFrames,2) The 2D position of the center of the pupil in the video frame. This is not registered to degrees visual angle, but could be used to detect saccades or other changes in eye position.
- timestamps.npy
face. Features extracted from the video of the frontal aspect of the subject, including the subject's face and forearms.
- motionEnergy.npy : [arb. units] (nFrames) The integrated motion energy across the whole frame, i.e. sum( (thisFrame-lastFrame)^2 ). Some smoothing is applied before this operation.
- timestamps.npy
lickPiezo. Voltage values from a thin-film piezo connected to the lick spout, so that values are proportional to deflection of the spout and licks can be detected as peaks of the signal.
- raw.npy : [V] (nSamples)
- timestamps.npy
licks. Extracted times of licks, from the lickPiezo signal.
- times.npy (nLicks)
spontaneous. Intervals of sufficient duration when nothing else is going on (no task or stimulus presentation)
- intervals.npy (nSpontaneousIntervals, 2)
wheel. The position reading of the rotary encoder attached to the rubber wheel that the mouse pushes left and right with his forelimbs.
- position.npy : [encoder ticks] (nSamples) The wheel has radius 31 mm and 1440 ticks per revolution, so multiply by 2*pi*r/tpr=0.135 to convert to millimeters. Positive velocity (increasing numbers) correspond to clockwise turns (if looking at the wheel from behind the mouse), i.e. turns that are in the correct direction for stimuli presented to the left. Likewise negative velocity corresponds to right choices.
- timestamps.npy
wheelMoves. Detected wheel movements
- type.npy [enumerated type] (nDetectedMoves) 0 for 'flinches' or otherwise unclassified movements, 1 for left/clockwise turns, 2 for right/counter-clockwise turns (where again "left" means "would be the correct direction for a stimulus presented on the left). A detected movement is counted as 'left' or 'right' only if it was sufficient amplitude that it would have registered a correct response (and possibly did), within a minimum amount of time from the start of the movement. Movements failing those criteria are flinch/unclassified type.
- intervals.npy (nDetectedMoves, 2)

Visual discrimination task

trials. A behavioral trial, as described in the manuscript. All times are relative to the same time base as every other time in the dataset, not to the start of the trial.
- feedbackType.npy [enumerated type] (nTrials) -1 for negative feedback (white noise burst); +1 for positive feedback (water reward delivery).
- feedback_times.npy (nTrials)
- goCue_times.npy (nTrials) The 'goCue' is referred to as the 'auditory tone cue' in the manuscript.
- included.npy [logical] (nTrials) Importantly, while this variable gives inclusion criteria according to the definition of disengagement (see manuscript Methods), it does not give inclusion criteria based on the time of response, as used for most analyses in the paper.
- repNum.npy [integer] (nTrials) Trials are repeated if they are "easy" trials (high contrast stimuli with large difference between the two sides, or the blank screen condition) and this keeps track of how many times the current trial's condition has been repeated.
- response_choice.npy [enumerated type] (nTrials) The response registered at the end of the trial, which determines the feedback according to the contrast condition. Note that in a small percentage of cases (~4%, see manuscript Methods) the initial wheel turn was in the opposite direction. -1 for Right choice (i.e. correct when stimuli are on the right); +1 for left choice; 0 for Nogo choice.
- response_times.npy (nTrials)
- visualStim_contrastLeft.npy [proportion contrast] (nTrials) A value of 0.5 means 50% contrast. 0 is a blank screen: no change to any pixel values on that side (completely undetectable).
- visualStim_contrastRight.npy [proportion contrast] (nTrials)
- visualStim_times.npy (nTrials)
- intervals.npy (nTrials,2)

Receptive field mapping 'task'

sparseNoise. White squares shown on the screen with randomized positions and timing - see manuscript Methods.
- positions.npy [degrees visual angle] (nStimuli, 2) The altitude (first column) and azimuth (second column) of the square.
- times.npy (nStimuli)

Passive stimulus replay 'task'

passiveBeeps. Auditory tones of the same frequency as the auditory tone cue in the task.
- times.npy (nBeeps)
passiveValveClick. Opening of the reward valve, but with a clamp in place such that no water flows. Therefore the auditory sound of the valve is heard, but no water reward is obtained.
- times.npy (nClicks)
passiveVisual. Gratings of the same size, spatial freq, position, etc as during the discrimination task.
- contrastLeft.npy [proportion contrast] (nGratings)
- contrastRight.npy [proportion contrast] (nGratings)
- times.npy (nGratings)
passiveWhiteNoise. The sound that accompanies and incorrect response during the discrimination task.
- times.npy (nBursts)

Neural data

channels. A recorded electrophysiological signal that originated from a particular site on a Neuropixels probe.
- brainLocation.tsv (nChannels,4) comprising:
  - ccf_ap [µm] (nChannels) The AP position in Allen Institute's Common Coordinate Framework.
  - ccf_dv [µm] (nChannels)
  - ccf_lr [µm] (nChannels)
  - allen_ontology [enumerated string] (nChannels) The acronym of the brain region determined to contain this channel in the Allen CCF.
- probe.npy [integer] (nChannels) The index of the probe containing the channel (0-indexed).
- rawRow.npy [integer] (nChannels) The row of the original data file that contained the data from this channel.
- site.npy [integer] (nChannels) The site number, in within-probe numbering, of the channel (in practice for this dataset this always starts at zero and counts up to 383 on each probe so is equivalent to the channel number - but if switches had been used, the site number could have been different than the channel number).
- sitePositions.npy [µm] (nChannels, 2) The x- and y-position of the site relative to the face of the probe (where the first column is across the face of the probe laterally and the second is the position along the length of the probe; the sites nearest the tip have second column=0).
clusters. A collection of spikes to be analyzed together. They are considered to arise from a single neuron except where annotated otherwise.
- _phy_annotation.npy [enumerated type] (nClusters) 0 = noise (these are already excluded and don't appear in this dataset at all); 1 = MUA (i.e. presumed to contain spikes from multiple neurons; these are not analyzed in any analyses in the paper); 2 = Good (manually labeled); 3 = Unsorted. In this dataset 'Good' was applied in a few but not all datasets to included neurons, so in general the neurons with _phy_annotation>=2 are the ones that should be included.
- depths.npy [µm] (nClusters) The position of the center of mass of the template of the cluster, relative to the probe. The deepest channel on the probe is depth=0, and the most superficial is depth=3820.
- originalIDs.npy [integer] (nClusters) The ID number of the cluster as it was during the original manual sorting in Phy. Can be ignored here.
- peakChannel.npy [integer] (nClusters) The channel number of the location of the peak of the cluster's waveform.
- probes.npy [integer] (nClusters) The probe on which the cluster was detected.
- templateWaveformChans.npy [integer] (nClusters,50) The indices of the top 50 channels for this neuron's waveform, by amplitude.
- templateWaveforms.npy [arb units] (nClusters,82,50) The template waveform shapes (across 82 time samples at 30kHz) on the top 50 channels, by amplitude. This dataset is to be considered together with templateWaveformChans. From the two of these, you can construct a full matrix of size (nClusters,82,384) of the template shapes across all channels.
- waveformDuration.npy [s] (nClusters) The trough-to-peak duration of the waveform on the peak channel.
probes. A Neuropixels probe that was inserted in the brain during this session.
- description.tsv (nProbes) Will always be 'Neuropixels Phase3A opt3' here. See neuropix.cortexlab.net for details about this nomenclature if required.
- insertion.tsv (nProbes,6) comprising the elements below. See the documentation here for definitions of these.
  - entry_point_rl (nProbes)
  - entry_point_ap (nProbes)
  - vertical_angle (nProbes)
  - horizontal_angle (nProbes)
  - axial_angle (nProbes)
  - distance_advanced (nProbes)
- rawFilename.tsv (nProbes) The original filename of the recorded data, for reference.
- sitePositions.npy (nSites,2) The positions of sites on the probe. This can be ignored in favor of channels.sitePositions described above. (In principle this ought to be size (nProbes, nSites, 2), and should contain positions of all 960 sites on the probe - but it is irrelevant since this dataset only includes recordings from the deepest sites, and from all identical probes).
spikes. Each element corresponds to one detected action potential.
- amps.npy [µV] (nSpikes) The peak-to-trough amplitude, obtained from the template and template-scaling amplitude returned by Kilosort (not from the raw data).
- clusters.npy (nSpikes) The cluster number of the spike, 0-indexed, matching rows of the clusters object.
- depths.npy [µm] (nSpikes) The position of the center of mass of the spike on the probe, determined from the principal component features returned by Kilosort. The deepest channel on the probe is depth=0, and the most superficial is depth=3820.
- times.npy (nSpikes)