23andMe to plink

Raw genome data downloaded from 23andMe must be converted to a different file format before plink can process it. The easiest conversion is to plink’s map and ped formats. On this page, I’ll briefly describe these file formats and provide a Python script to perform the conversion.

Pipeline

23andMe’s Raw Genome File Format

A raw genome file downloaded from 23andMe is a tab-separated ASCII text file consisting of four columns for an RSID (Reference SNP Cluster ID), a chromosome, a position on that chromosome, and a pair of nucleotides. Each record in the file represents a SNP. For example,

:
# Below is a text version of your data.  Fields are TAB-separated.  Each line
# corresponds to a single SNP.  For each SNP, we provide its identifier (an
# rsid or an internal id), its location on the reference human genome, and the
# genotype call oriented with respect to the plus strand on the human reference
# sequence.  We are using reference human assembly build 37 (also known as
# Annotation Release 104).  Note that it is possible that data downloaded at
# different times may be different due to ongoing improvements in our ability
# to call genotypes.
:
# rsid        chromosome   position   genotype
rs12564807    1            734462     AA
rs3131972     1            752721     AG
rs148828841   1            760998     CC
rs12124819    1            776546     AA
rs115093905   1            787173     GG
:

These SNP records continue for several hundred thousand lines—actually 602,347 records, to be exact, in the 723 files I recently had the pleasure of analyzing.

The rsid field provides a unique identifier for the record. No two lines in the raw genome file ever contain the same RSID. For RSIDs that begin with ‘rs’, information regarding the SNP can be found in the dbSNP database from the National Center for Biotechnology Information.

The chromosome field ranges from 1 to 22, inclusive; and it may also contain entries X, Y, XY, and MT. For females, SNPs on the Y chromosome are not available.

The position field is an integer that ranges from 3 to 249218992 in the data I have analyzed. The combination of the chromosome field and the position field may not be unique within the file. Furthermore, the genotypes for SNPs with the same chromosome and position may differ as well. For example, we may see something like the following scattered throughout the file:

# rsid        chromosome   position   genotype
rs35669628    11           5246865    GG
rs63751034    11           5246865    II
:
i4000438      15           72640388   CC
i4000439      15           72640388   CC
i5004858      15           72640388   GG
rs76173977    15           72640388   CC
i6051228      15           72640388   CC

Lastly, the genotype field may contain one or two symbols representing nucleotides. If only one symbol is present, it is assumed the second nucleotide is the same as the first; e.g. A represents the same information as AA. This field may also contain dashes (--), which indicate the SNP information is not available. Otherwise, each symbol is one of the following: A, C, T, G, D, or I.

plink’s Flat File Formats

plink reads a collection of SNP information from two files. The first file contains the pedigree and nucleotide information for an individual; this is called the ped file. The second file specifies the order of the SNPs within the pedigree file, and it provides meta-data about those SNPs; this is the map file.

plink’s ped File Format

A ped file is a whitespace-separated ASCII text file consisting of a header and then some number of nucleotide symbols. For example,

family_id individual_id dad_id mom_id 2 -9
A A A G C C A A G G G G A A A C C C
A A G G C T C C T T G G G G G G G G
G G C C C C G G G G G G G G A A C C
A A C C C C C C A A C C A G 0 0 G G
:

Perhaps the most important field of the header is the individual’s identification code. This identifier is used to locate the individual after performing some analysis with plink. The exact text stored for the individual’s identifier is domain specific, and in the examples that follow, I will simply use foo.

Other identifiers in the header specify the family, father, and mother. When converting data from 23andMe’s raw genome files, however, it’s unlikely this information will be available. In the examples that follow, I will simply use foo_FAM, foo_FATHER, and foo_MOTHER.

After the identifiers come two more fields for the gender and the phenotype. The gender is encoded as follows: 1 for male and 2 for female; any other value indicates the gender is unknown. The last field, the phenotype, is generally not available from the 23andMe data, and it is common to store -9 in this case, which indicates the phenotype is missing.

After the header comes (many) nucleotides, again separated by whitespace. The symbols stored here match the symbols provided by 23andMe’s raw genome files with one exception: instead of encoding missing genotype information with dashes (--), plink expects them to be encoded with 0s (0 0).

An important requirement to note is that the order of the nucleotides must match the order of the SNPs from a corresponding map file. The number of nucleotides, therefore, will equal twice the number of SNPs present in that file.

plink’s map File Format

A map file is a tab-separated ASCII text file consisting of four columns for a chromosome, a SNP identifier, an optional position, and a base-pair coordinate. Each record in the file provides information about a SNP, and the order of the records must match the order of the base-pairs present in one or more corresponding ped files. For example,

1    rs12564807    0    734462
1    rs3131972     0    752721
1    rs148828841   0    760998
1    rs12124819    0    776546
1    rs115093905   0    787173
:

The first column of the map file indicates the chromosome, and the values here are the same as the ones from 23andMe’s raw genome file—with a few exceptions. In samples I have seen, plink chromosomes range from 1 to 26, inclusive, where 23, 24, 25, and 26 correspond to chromosomes X, Y, XY, and MT, respectively.

The second column indicates a variant identifier, which directly corresponds to 23andMe’s rsid field. Every record in the map file contains a unique identifier.

The third column indicates the variant’s position in morgans, but this column is optional. To ignore this column, we provide 0, and this seems to be a valid and appropriate entry for many scenarios.

Finally, the fourth column indicates base-pair coordinate, which directly corresponds to 23andMe’s position field.

The map file does not include information specific to an individual. This means that for a set of raw genome files downloaded from 23andMe, a single map file can be used for all individuals in the study.

If all SNPs are converted from a raw genome file from 23andMe, then the number of records in the map file will equal the number of records in the genome file.

Conversion Using Python

The following Python script performs the conversion described above for a file specified on the command line. The inputed genome file is transformed into two output files with the same base-name but different extensions, map and ped. If gender information is available, it can be specified on the command line using the --gender option, for which valid arguments are male and female.