I ran into this same issue.  Let me say that (for whatever reason) 
setting OMP_NUM_THREADS, or MKL_NUM_THREADS did not seem to fix the 
problem.  MKL was still creating many threads per process, though 
perhaps only one thread was active at at time.

I found a better solution, offering increased performance at least for 
parallel jobs, was to link with the MKL serial library.  This can be 
done by modifying the following lines in make.sys (for MKL 10)

BLAS_LIBS =  -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
LAPACK_LIBS =  -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

then doing:
make clean (probably unnecessary)

Although this gives WORSE performance when you are only running a single 
job (since it does not thread to take advantage of the other cores), if 
you are trying to fully utilize your nodes by running several jobs or 
parallel jobs, using the threaded library results in a giant mess.

