Filter bad input characters using whitelists in PHP

When working with input data using PHP and storing it using MySQL, it can be a bit of a nightmare to filter out bad characters. Any time I’ve had to work with a large dataset full of customer data or product data, it was inevitable that there would be invalid characters, either because it wasn’t supported by the charset of the database, or because whatever system created the original datafile file didn’t support the characters, leading to garbage text. When dealing with customer data it was important not to lose any information during the importing process. This usually involved a conversion table of one kind or another, converting the invalid characters to friendly equivalents so they could be viewed correctly. Especially important, for example, when dealing with customer addresses. Even if it was possible to store the original characters, many third-party shipping services didn’t support them anyway.

So for customer data this was fine, but where this manual approach wasn’t very practical was when creating a large retail product database, which then also had to be uploaded to a third-party platform to list the products in other e-commerce stores. The product data was obtained from a variety of difference sources, which further increased the likelihood of compatibility issues. Time and time again, there would be an unusual character somewhere in the data that would break the entire product description when it was uploaded to the third-party platform. I traced the problem down to the third-party’s MySQL query, which wasn’t robust enough to handle unsupported characters. It broke the query right at that point the invalid character was found, and the rest of the data wasn’t uploaded. Because this data also included a html template for the product description, it led to some very broken product listings. It wasn’t possible to fix this problem at the source, so I had to look at ways of sanitising my data.

Given the amount of products that were being uploaded, it simply wasn’t practical to manually correct each one, especially when there was no real way of knowing which product listings were correct and which were broken due to invalid characters, and in almost every case it was just one stray invalid character that didn’t have any impact on the product description. The conversion table option was also out, as this would have taken just as long. Given there were time constraints, this gave me two options; a blacklist of invalid characters, or a whitelist of supported characters. A blacklist seemed just as much work, as it would still require finding all of the invalid characters, and also dealing with the issue of finding them correctly in PHP, which is rarely straight forward when dealing with different charsets. This can involve needing to find the hex code of a lot of invalid characters in order to replace them. This left me with the option of using a whitelist; a list of characters I knew would not break the third-party’s MySQL query, and would result in almost negligible data loss in the product listings.

This is what I came up with;

This function takes in a whitelist string and an input string. Anything not in the whitelist is removed, and then the string is returned by the function with only the characters from the whitelist remaining. For ease of use with my script, I stored my whitelist string in an external .txt file so it could be edited when needed. This was the whitelist I used;

    ! “#$%&'()*+,-./0123456789:<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Which was sufficient in removing the stray unsupported characters from the imported product listings without having any major impact on the quality of the data. While not a perfect solution, it is a good compromise, and additional characters can be added or removed from the whitelist to refine it over time.

Leave a Reply

Your email address will not be published. Required fields are marked *