·

photo credit: blade via photopin (license)

Best practices when it comes to importing CSV files

Tags: , ,

I often have situations where I – as a web developer – have to link together two different applications. In many cases, this link is created by simply exchanging CSV-documents with an third party. Now when it comes to importing and exporting in PHP (or any other language for that matter) it’s important that you have a protocol that is as rock-solid as possible. In this article I’ll try to explain to you how I handle these kind of cases, based on my experience with writing numerous import- and export scripts.

Step 1: Communicate with your client

A very bad thing to do when it comes to stuff like this, is to immediately start programming without consulting your client or the third party that works on the other end of the connection you’re trying to establish. Even if they told you to “just go ahead” and “we’ll match our system to whatever you come up with”, don’t do it! You’re gonna have a bad time!
Instead, talk to the people on the other end of the table. What do they do? What’s so important about the system you’re going to synchronize with? How does it fit in the bigger picture? The more you know, the more pro-active you can be with your work, and the better your solution will be.
The most important part of this communication is to get a definition of the format in which you are going to build the connection:

  • If it’s a CSV, how many columns does it have and what are their properties?
  • If it’s an XML document, how is it structured? (this can be used to create a DTD).
  • On what frequency is the synchronization going to run?
  • What are the names / descriptions of the various fields?
  • What is the type of each field? What values can it have? Is there a maximum length?
  • What’s the format and encoding thats going to be used?
  • Is there a maximum file size? What file size can I expect?
  • Etc.

The most important part of this step is to document it all. You’ll have a document to fall back to when there are new demands or when the clients thinks you’ve missed something. It’s also a great aid for future developers who are going to continue on the work you’ve already done, or for the third party if they also have to create an implementation on their end.

Step 2: Validate the data

One important aspect of all importers and exporters is to validate the data that you’re working with. Especially when it comes to importing an external feed, you can’t blindly assume the people on the other end know what they’re doing. Some examples of what can happen (and what I’ve seen occasionally):

  • Invalid CSV data (double quotes in text for example).
  • Misspelled, shuffled, forgotten or the addition of one or more columns.
  • Leave fields empty that are required.
  • Use invalid data for some fields (‘1’ or ‘true’ instead of ‘yes’ for example).
  • Usage of the wrong document encoding, and therefore breaking some of the characters.

As you can see, it’s very important that you validate the data you have to work with. After all, you are the one who is creating the connection, so if anything goes wrong because of a third party sending out wrong information, you will be the one who gets addressed to first.
Another thing to consider is how to deal with invalid results. If the data is invalid, should we just skip the row? Must we stop the entire import? Perhaps we must validate every row before we even start importing the first one? The answers to these questions depend on the kind of import you are doing and what the requests of your client are.

Examples

In these examples we assume we are creating an importer in PHP that imports CSV files into our application.

Header validation

The most basic validation we can do prior before importing, is checking if the headers of the CSV file match with the headers we expect. This functionality could look something like this:

Required fields

Another common situation is that you have required fields in your CSV that must be populated. A validation like this can simply be done like so:

More specific validations

Depending on your CSV and your application different validation rules may apply. Fields might be numeric, floating, yes/no, maximum length, etc. For these cases you can write your own validators:

Step 3: Run the import

After you’ve validated the data you are ready to import it in your application. The nice thing about validating it first, is that you don’t have to validate it again. That means that your import methods don’t have to worry about if data is present or not; you’re validation already took care of that. And less code means less bugs (right?)

Import or update?

One mistake that is often made when creating an importer (especially when it’s an ad-hoc importer) is the it’s just that: it imports new data. But in most cases you need to be aware that your importer must also run as an updater. Even more so: my experience is that you should always create your importer as an updater.
Even if the job is something like “we only have to run the import once” or “they’re always new entries, never already existing ones” you can be sure that sooner or later your client will send an e-mail with the new ‘latest’ feed. And you don’t want to manually delete old entries prior before each import now do you?
When it comes to the choice: “import or update?” it’s important to have a unique key for each row in your product feed to check if your application is already aware of the entry. Needless to say, you need to save this key in your application to provide this check for your importer (even if it’s not used by your application at all):

It’s a different case when it’s a system requirement that an entry in your application must not be overwritten (or at least some fields of it). But that’s up to the clients’ request and the kind of job you are trying to accomplish.

Step 4: Send feedback

Another equally important aspect of importing and exporting is sending feedback. Some possible outcomes for your solution can be:

  • Something went horribly wrong: the system couldn’t import or export anything!
  • Not everything passed the validation process.
  • All data is valid, but the application you’re importing in is not cooperating.
  • Everything went a-ok.

No matter what outcome it can be, you must always log what your solution is doing. And if it’s necessary: report it to your client and/or yourself.

Logging methods

A basic logging class could look like this:

This class can be used to save a message to the log and display it directly. This is ideal for when the importer is ran from the command-line (I always write my importers as CLI tools).
The benefit of keeping all logs in an array as well is so you can use the destructor of the class to output all log data to a file. The destructor can also be used to send a notification e-mail when everything is done.

Notifications

Now in what scenarios do you want to be notified? I can imagine that if the importer didn’t finish you want at least to know why and for what reasons. Even when the script totally crashes you want to be notified. Logging everything and wrapping your methods in try – and catch -statements can be a great aid in this:

Also think about the situations in which you want to receive an e-mail notification from your solution. Of course you want this on errors and/or notifications, but you might also want to receive e-mails the first weeks (or months) after the solution is installed or updated to make sure it works as intended.

Always log

It’s recommended to log all information that might be helpful in scenarios where you have to debug or fix your importer. Things like:

  • Validation messages (on validation failures).
  • Row numbers of records that you are trying to import.
  • Data that fails validating.
  • Other parameters, like memory usage or duration.

Stuffing your importer with log-messages might be helpful, but don’t overdo it: try to only log the information that matters. Otherwise your log clutters full with useless data that only makes tracking down a potential problem more cumbersome:

In above example, the ‘bad example’ outputs all kind of data that’s not of great aid for your log. Imagine how your log will look like when 9.999 records are valid, but only 1 has failed. Exactly! The ‘better example’ only logs when there’s something worth logging, making tracking down problems much easier.

Step 5: Test it!

Last but not least: make sure your code works as intended. Test it thoroughly and make use of easy tools like PhpUnit for example to write extensive tests for you solution. You can do this locally by generating dummy (CSV-) feeds and a testdatabase for example, but that’s beyond the scope of this article.
It is important that you’re not only testing the import- or export process, but also test the validating process, the logging and the notifications. For all you know the importing process works as expected, but the notification e-mail is never sent. The result might then be a solution that isn’t working correctly but nobody knows. Well… usually it’s the client who will notice that the stock status of his webshop isn’t updating correctly and that new products aren’t shown on the shop. Anyway… Test it!

In conclusion

Writing an importer or exporter for any application might be a task we all have to do at some point in our career as a web developer. I myself have written various import- and export solutions for applications like Magento, WordPress and various other CMS-es.
Although a task like this might look a bit difficult at first, if you take the right approach and write your importer with these advices in mind, you don’t only have a solid importer, you’ll also sleep a lot better at night. And in the end, isn’t that worth coding for?

Visitors give this article an average rating of 4.0 out of 5.

How would you rate this article?

One thought on “Best practices when it comes to importing CSV files”

  1. Thomas Cheng says:

    Interesting article. Just a question by curiosity. I see that you save all the log data on destructor (and possibly sending a notification email after everything is done there). You mentioned that when the script totally crashes, you want to be notified. But from the implementation, I believe the log won’t be saved and notifications won’t be sent either when the script totally crashes. Or have I misunderstood something?

Leave a Reply