Sequence Errors in Your Gene for Expression?

yelluk

Apr 3, 20173 min read

Updated: Nov 28, 2023

If you don’t achieve the levels of protein expression expected in your project, have you ever considered that there may be sequence errors in your gene? This applies to all expression systems, not just those based on baculoviruses. We should be clear that we are not talking about sequence errors in your gene after it has been synthesized by an out sourcing company.

Quality control by such companies is such that it is extremely rare for there to be a problem. No, we are referring to data that you may have down loaded from online sources, principally GenBank. These databases host information from the very early days of sequencing, when deriving the sequence of an entire gene was a major achievement.

The methods used now seem archaic, comprising such arcane procedures as Maxim and Gilbert chemical sequencing, which used reagents that would have current day Health and Safety officials running screaming for the hills! More convenient methods involved Sanger sequencing, which was devoid of particularly hazardous chemical usage. However, both methods required the operator to make up polyacrylamide gels so that radiolabelled DNA fragments could be resolved. The results also had to be read manually at first with transposition to paper then to computer. Our CEO recalls the days when there was only one computer terminal per department and people queued up to use it! He also used both sequencing methods in his prime and recalls the running of one meter long polyacrylamide gels.

We digress from the main topic of this blog. The point about historic sequencing techniques is that they were prone to quite high error rates, depending on the method used, the skill of the operator and the nature of the DNA target. Modern, high throughput sequencing tends to be done by a few specialised centres with massive duplication of reads so that every DNA strand is covered many times. So if you base a gene construct on a modern day sequencing project you are unlikely to encounter problems with errors in the DNA. If, however, you pick a gene sequence from relative antiquity, say the 1980/90’s, there is a significant risk of mistakes in the raw data. Now these may be relatively minor, only affecting a few amino acids, but they can involve quite large deletions that will have a major impact on protein function.

Why do we think that this is an issue? Well, recently we were trying to make a protein for a customer based on sequence data from over 30 years’ ago. We worked from a GenBank data file to design and construct a cDNA for expression. After making a recombinant virus and testing expression we were surprised to find no significant amount of protein. This was particularly perplexing as the gene had been expressed from a cDNA soon after it was sequenced by the original authors. After much head scratching we did a little data base searching and lined up other examples of this gene. What was apparent was that related sequences displayed quite large differences to our target. These were so significant that it is hard to see how our gene sequence could be correct. Further, they also occurred in regions of the protein involved in processing and targeting, which almost certainly will affect stability

So how could the original authors of our target sequence manage to express the gene and make a protein? When we looked back at the original paper we noticed that the gene had been expressed from a cDNA derived from an mRNA library. This cDNA was also used to sequence the gene. These two aspects of the project were completely separate. Although the sequence data was used to some extent to design an expression vector, it didn’t really matter if there were a few errors in the results. By using the DNA copy of the mRNA, probably a completely faithful copy of the gene was produced.

What can we do to rescue our project and obtain gene expression? Well, we have just designed and ordered a new synthetic gene construct based on a consensus of data files for the target, which we consider to represent the most authentic sequence. Once we have this gene we will make another recombinant virus and test protein production again. Our failure to produce the protein using our initial construct may have nothing to do with sequence errors but the sequence comparisons we made resembles strongly a “smoking gun”. We will let you know how successful this is in a future blog.

Sequence Errors in Your Gene for Expression?

Recent Posts

Expert in Baculovirus protein expression.