Calculating a sample size based on a confidence level

It's not surprising you're a bit confused; understanding what's really going on with confidence intervals can be tricky.

The short version: If you don't want to check all the files you have to choose two different percentages: the confidence level (95% in your example), and how far off you're willing to be at that level (20% in your example). These percentages refer to two different quantities, and so it doesn't make sense to add or subtract them from each other. Once you've made these choices then I think it is fine to use the online calculator to get a sample size.

If you want more detail on what's going on, here's the explanation: You're trying to estimate the true percentage of files that have correct data. Let's call that percentage $p$. Since you don't want to calculate $p$ exactly, you have to choose how far off you are willing to be with your estimate, say 20%. Unfortunately, you can't even be certain that your estimate of $p$ will be within 20%, so you have to choose a level of confidence that that estimate will be within 20% of $p$. You have chosen 95%. Then the online calculator gives you the sample size of 23 you need to estimate $p$ to within 20% at 95% confidence.

But what does that 95% really mean? Basically, it means that if you were to choose lots and lots of samples of size 23 and calculate a confidence interval from each one, 95% of the resulting confidence intervals would actually contain the unknown value of $p$. The other 5% would give an interval of some kind that does not include $p$. (Some would be too large, others would be too small.) Another way to look at it is that choosing a 95% confidence interval means that you're choosing a method that gives correct results (i.e., produces a confidence interval that actually contains the value of $p$) 95% of the time.

To answer your specific questions:

"Does that mean that 'I can be 95% confident that 80% to 100% of the files are correct'?" Not precisely. It really means that you can be 95% confident that the true percentage of correct files is between 80% and 100%. That's a subtle distinction.

"And only then I can say with 95% confidence that the files are correct? (99% +- 4% = 95% to 100%)?" No, this is confusing the two kinds of percentages. The 99% refers to 99% of all confidence intervals constructed if you were to construct lots of them. The 4% refers to an error margin of $\pm$ 4% for the files.

One other thing to remember is that the sample size estimator assumes that the population you're drawing from is much, much larger than the size of the sample you end up going with. Since your population is fairly small you can get away with a smaller-sized sample with the same level of confidence. The determination of exactly how small, though, is a much more difficult calculation. It's beyond what you would have seen in a basic statistics class. I'm not sure how to do it; maybe someone else on the site does. (EDIT: Even better: take Jyotirmoy Bhattacharya's suggestion and ask on Stats Stack Exchange.) But this is the only justification for being able to use a smaller sample size than 23 - not the fact that you would abort the confidence interval calculation if you found anything other than 100% for your sample's estimate of the true value of $p$.