Finding and downloading raw data

Overview

Teaching: 0 min
Exercises: 0 min

Questions

Key question (FIXME)

Objectives

First learning objective. (FIXME)

In this tutorial we will be looking at some published data on the binding of Gata1 and Gata2 in G1E cells (an erythroid cell line in which Gata1 is fused to an ER domain, allowing for induction of terminal erythroid differentiation using tamoxifen).

The paper with this information is Trompouki et al., published in Cell in 2011.

SRX100313

We need to download the file from the EBI direct to our server. We can do this with a programme called wget:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR351/SRR351407/SRR351407.fastq.gz

This has downloaded one file called SRR351407.fastq.gz

Let’s try to examine the contents of the file:

head SRR351407.fastq.gz

��}��85��"U�<�3c��N@�H����ĩJX�|�O�O������3����	�~9��������m����������?�
o��������3թ>��R�I��{�������|�u]�V�ժmZoUg���MW{��qVjߵǮp+Xmt�y��qV;�j_��[U�Y
�%��9Ϊ���j�}V-Z�ű{�,$����}�R�N�k���J�Iw�`�TJ��,��:v#���*˲Uە|�7�Ux��@Ò�*˾Q�.
z�RrTe�V��,�R}y�a�]�u��*�_��R�X�Q��TS6�X�QI�l�����]V�TG�]VӐ�:r_�.�i���T�.�
鋣}e%����+�eu�.~â�����U������<�=V�����+z�^�G?��T���

We get a lot of nonsense because this is a zipped file. We can use a command called zcat to unzip these types of files and print them to the terminal. Since this file has millions of lines, we want to make sure we pipe the output to head.

zcat SRR351407.fastq.gz | head -n 12

@SRR351407.1 WICMT-SOLEXA2:3:1:1801:997/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
####################################
@SRR351407.2 WICMT-SOLEXA2:3:1:2545:999/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
####################################
@SRR351407.3 WICMT-SOLEXA2:3:1:3463:995/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
####################################

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a ‘@’ character and is followed by the read name
Line 2 is the raw DNA sequence
Line 3 begins with a ‘+’ character
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

So our first read is called SRR351407.1 WICMT-SOLEXA2:3:1:1801:997/1 and its sequence is NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN. We obtained no useful sequence information from this read. This is quite common at the beginning of a fastq file, especially from older Illumina machines. We need to look at some reads which are further into the file.

Peeking inside the middle of a file

Write a command that will allow us to view lines 10000-10012 of SRR351407.fastq.gz

Solution

You need to combine zcat, tail and head in a pipe to do this

zcat SRR351407.fastq.gz | head -n 10012 | tail -n 12

@SRR351407.2501 WICMT-SOLEXA2:3:1:7736:1233/1
ACACCTTTTCCTGCAGGGACATCGTCTGCCACCGAC
+
GGGEGGEDEDGGGGG>3?5AG@GGDGGEGD8BAAAD
@SRR351407.2502 WICMT-SOLEXA2:3:1:7847:1231/1
GTCACTGCCACTGTTGGCCAGGATGCACACACACAC
+
IIIIIIIIIIIIIIGIIIIFIIEIIEEGEG@IGGGG
@SRR351407.2503 WICMT-SOLEXA2:3:1:7925:1234/1
GGAGGCAAACCTGACCTGCCTTCCCTGTAACGGTGG
+
GIIIIIIHIIIIIIIIIIIIIIIIIIIHIIFIIDID

These fastq files are big, and will take a long time to process. To be able to complete this tutorial in a reasonable amount of time, we can download some data files which have had most of the reads removed, leaving only those that map to chromosome 11. We use wget for this, and then unzip the files using a program called tar.

wget https://rob.beagrie.com/media/chip-tutorial/chip-tutorial-files.tar.gz
tar zxvf chip-tutorial-files.tar.gz
cd chip-tutorial
ls fastqs

G1E_ATAC_0h_r1A_end1.chr11.fastq.gz  G1E_ATAC_30h_r1_end1.chr11.fastq.gz
G1E_ATAC_0h_r1A_end2.chr11.fastq.gz  G1E_ATAC_30h_r1_end2.chr11.fastq.gz
G1E_ATAC_0h_r1B_end1.chr11.fastq.gz  G1E_ATAC_30h_r2_end1.chr11.fastq.gz
G1E_ATAC_0h_r1B_end2.chr11.fastq.gz  G1E_ATAC_30h_r2_end2.chr11.fastq.gz
G1E_ATAC_0h_r1_end1.chr11.fastq.gz   SRR351406.chr11.fastq.gz
G1E_ATAC_0h_r1_end2.chr11.fastq.gz   SRR351407.chr11.fastq.gz
G1E_ATAC_0h_r2_end1.chr11.fastq.gz   SRR351409.chr11.fastq.gz
G1E_ATAC_0h_r2_end2.chr11.fastq.gz   SRR351410.chr11.fastq.gz

Our example data contains some ATAC-seq data files, which have sensible names. We’ve also downloaded four files from Trompouki et al. which are named using their ID from GEO. It’s going to be difficlut to remember what these cells are. We could rename them, but then we might forget which file relates to which piece of published data. One solution here is to make a symbolic link using ln -s

cd fastqs
ln -s SRR351406.chr11.fastq.gz G1E_ChIP_Input_0h.chr11.fastq.gz
ln -s SRR351407.chr11.fastq.gz G1E_ChIP_Gata2_0h.chr11.fastq.gz
zcat G1E_ChIP_Input_0h.chr11.fastq.gz | head

Linking files

Make symbolic links to name the two remaining files from the G1E paper following the same convention.
Solution

SRR351409 and SRR351410 are the input and the Gata1 ChIP from G1ER cells treated for 24h with estradiol. So a sensible name would look something like this:
ln -s SRR351409.chr11.fastq.gz G1E_ChIP_Input_24h.chr11.fastq.gz
ln -s SRR351410.chr11.fastq.gz G1E_ChIP_Gata1_24h.chr11.fastq.gz

Key Points

First key point. Brief Answer to questions. (FIXME)

previous episode

Analysing ChIP-seq and ATAC-seq data

next episode

Finding and downloading raw data

Overview

Peeking inside the middle of a file

Solution

Linking files

Solution

Key Points

previous episode

next episode