Sequence Errors in Your Gene for Expression? (Parts I + II)

Published on April 3, 2017

If you don’t achieve the levels of protein expression expected in your project, have you ever considered that there may be sequence errors in your gene?  This applies to all expression systems, not just those based on baculoviruses.  We should be clear that we are not talking about sequence errors in your gene after it has been synthesized by an out sourcing company.  Quality control by such companies is such that it is extremely rare for there to be a problem.  No, we are referring to data that you may have down loaded from online sources, principally GenBank.  These databases host information from the very early days of sequencing, when deriving the sequence of an entire gene was a major achievement.
The methods used now seem archaic, comprising such arcane procedures as Maxim and Gilbert chemical sequencing, which used reagents that would have current day Health and Safety officials running screaming for the hills!  More convenient methods involved Sanger sequencing, which was devoid of particularly hazardous chemical usage.  However, both methods required the operator to make up polyacrylamide gels so that radiolabelled DNA fragments could be resolved.  The results also had to be read manually at first with transposition to paper then to computer.  Our CEO recalls the days when there was only one computer terminal per department and people queued up to use it!  He also used both sequencing methods in his prime and recalls the running of one meter long polyacrylamide gels.
We digress from the main topic of this blog.  The point about historic sequencing techniques is that they were prone to quite high error rates, depending on the method used, the skill of the operator and the nature of the DNA target.  Modern, high throughput sequencing tends to be done by a few specialised centres with massive duplication of reads so that every DNA strand is covered many times.  So if you base a gene construct on a modern day sequencing project you are unlikely to encounter problems with errors in the DNA.  If, however, you pick a gene sequence from relative antiquity, say the 1980/90’s, there is a significant risk of mistakes in the raw data.  Now these may be relatively minor, only affecting a few amino acids, but they can involve quite large deletions that will have a major impact on protein function.
Why do we think that this is an issue?  Well, recently we were trying to make a protein for a customer based on sequence data from over 30 years’ ago.  We worked from a GenBank data file to design and construct a cDNA for expression.  After making a recombinant virus and testing expression we were surprised to find no significant amount of protein.  This was particularly perplexing as the gene had been expressed from a cDNA soon after it was sequenced by the original authors.  After much head scratching we did a little data base searching and lined up other examples of this gene.  What was apparent was that related sequences displayed quite large differences to our target.  These were so significant that it is hard to see how our gene sequence could be correct.  Further, they also occurred in regions of the protein involved in processing and targeting, which almost certainly will affect stability
So how could the original authors of our target sequence manage to express the gene and make a protein?  When we looked back at the original paper we noticed that the gene had been expressed from a cDNA derived from an mRNA library.  This cDNA was also used to sequence the gene.  These two aspects of the project were completely separate.  Although the sequence data was used to some extent to design an expression vector, it didn’t really matter if there were a few errors in the results.  By using the DNA copy of the mRNA, probably a completely faithful copy of the gene was produced.
What can we do to rescue our project and obtain gene expression?  Well, we have just designed and ordered a new synthetic gene construct based on a consensus of data files for the target, which we consider to represent the most authentic sequence.  Once we have this gene we will make another recombinant virus and test protein production again.  Our failure to produce the protein using our initial construct may have nothing to do with sequence errors but the sequence comparisons we made resembles strongly a “smoking gun”.  We will let you know how successful this is in a future blog.

And here is Part II!

Sequence errors in your gene for expression Part II follows on from an earlier blog where we discussed a problem we were having producing a particular protein after synthesizing a gene based on very old sequence data. The baculovirus we made containing this synthetic gene produced very low levels of recombinant protein that were hardly visible after immunoblot analysis. This wasn’t a good start to a project where a purified protein was required for diagnostic purposes.
We went on to discuss possible sequence errors in the target gene as a possible reason why we were not seeing much protein production. Our suspicions were accentuated because earlier reports where a cDNA was used for expression purposes had yielded good levels of recombinant material. Unfortunately, we didn’t have access to this cDNA clone to enable us to remake the virus for protein production.
We resorted to comparing more recent data for our target gene with the information we had downloaded from GenBank originally. This highlighted an alarming number of inconsistencies between the sequences. In consequence, we designed a new synthetic gene, had it made and then inserted it into one of our flashBAC™ vectors. A few days later, having made our recombinant virus and done some quick tests for protein expression, we were very pleased to see a nice band reacting with our specific antiserum on immuno blots. The next steps are to scale up protein production and purify the protein, which unfortunately we can’t identify in this blog.
The lesson we have learned from this experience is never assume GenBank sequence is 100% accurate! Although our second round of gene synthesis also used a GenBank file, we were able to be reasonably sure it was accurate owing to the fact that many other independent data entries supported it. This still left a slight uncertainty in the outcome of the project but as anyone doing protein expression will testify, there is nothing like a shot of adrenalin in the afternoon as you wait for your western blot to develop bands of recombinant protein!
If you are having a particular problem with producing your recombinant protein send us an email via the OET Ltd website with some general information about your project and we will attempt to help.

Get in touch