An FTP wrapper class for file handling in PHP

Some time ago, I had to develop a system for handling a large amount of CSV files that were automatically being uploaded to an FTP location. As it was an automated system, I needed a class flexible enough to handle a lot of the common ftp functions I would be making use of. It also contains a couple of additional functions that were more specific to my needs. The latest_file method will take the latest file from the current directory based on the date the file was last modified. This was pretty essential to my needs at the time as the directory was being filled with new CSV files on a regular basis and were irregularly named, so I had to ensure I was only taking the latest file.

The class is also able to make several connection attempts to an FTP, which you can modify using the methods set_attempts and set_delay, which sets the delay between attempts in seconds.

Here’s a basic usage example;

Here is the entire class;

Filter bad input characters using whitelists in PHP

When working with input data using PHP and storing it using MySQL, it can be a bit of a nightmare to filter out bad characters. Any time I’ve had to work with a large dataset full of customer data or product data, it was inevitable that there would be invalid characters, either because it wasn’t supported by the charset of the database, or because whatever system created the original datafile file didn’t support the characters, leading to garbage text. When dealing with customer data it was important not to lose any information during the importing process. This usually involved a conversion table of one kind or another, converting the invalid characters to friendly equivalents so they could be viewed correctly. Especially important, for example, when dealing with customer addresses. Even if it was possible to store the original characters, many third-party shipping services didn’t support them anyway.

So for customer data this was fine, but where this manual approach wasn’t very practical was when creating a large retail product database, which then also had to be uploaded to a third-party platform to list the products in other e-commerce stores. The product data was obtained from a variety of difference sources, which further increased the likelihood of compatibility issues. Time and time again, there would be an unusual character somewhere in the data that would break the entire product description when it was uploaded to the third-party platform. I traced the problem down to the third-party’s MySQL query, which wasn’t robust enough to handle unsupported characters. It broke the query right at that point the invalid character was found, and the rest of the data wasn’t uploaded. Because this data also included a html template for the product description, it led to some very broken product listings. It wasn’t possible to fix this problem at the source, so I had to look at ways of sanitising my data.

Given the amount of products that were being uploaded, it simply wasn’t practical to manually correct each one, especially when there was no real way of knowing which product listings were correct and which were broken due to invalid characters, and in almost every case it was just one stray invalid character that didn’t have any impact on the product description. The conversion table option was also out, as this would have taken just as long. Given there were time constraints, this gave me two options; a blacklist of invalid characters, or a whitelist of supported characters. A blacklist seemed just as much work, as it would still require finding all of the invalid characters, and also dealing with the issue of finding them correctly in PHP, which is rarely straight forward when dealing with different charsets. This can involve needing to find the hex code of a lot of invalid characters in order to replace them. This left me with the option of using a whitelist; a list of characters I knew would not break the third-party’s MySQL query, and would result in almost negligible data loss in the product listings.

This is what I came up with;

This function takes in a whitelist string and an input string. Anything not in the whitelist is removed, and then the string is returned by the function with only the characters from the whitelist remaining. For ease of use with my script, I stored my whitelist string in an external .txt file so it could be edited when needed. This was the whitelist I used;

    ! “#$%&'()*+,-./0123456789:<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Which was sufficient in removing the stray unsupported characters from the imported product listings without having any major impact on the quality of the data. While not a perfect solution, it is a good compromise, and additional characters can be added or removed from the whitelist to refine it over time.

Obtain the first or last word inside a string in C++

If you do a lot of work with strings in C++, one thing you’ll probably find yourself doing is writing utility functions. The String class provides many useful built-in methods, but when it comes to searching, replacing and text manipulation, you may find that you’ll need to create your own functions to carry out these tasks, sometimes using and combining the built-in class methods to carry out more complex tasks within your projects.

The two following functions can be used to obtain the first and last words contained within a string. An empty string is returned if the function fails. The functions make use of the built in String methods find_first_of and find_last_of;

Having a decent set of your own utility functions can save a lot of time, and will make your code a lot more manageable. A good strategy for code re-usability is to break up a bigger task into smaller functions that you can see yourself re-using in the future, saving yourself both time and making your code easier to understand.

This is akin to your toolbox, and if you find yourself needing the same tool over and over again, then that’s when it’s a good idea to write your own utility function or use someone else’s. There’s always that temptation to write everything for yourself, but one of the key aspects of code re-usability is to not re-invent the wheel.

Writing your own utility functions does have some benefits, however. Firstly, you’ll understand the code very well and learn new techniques, and secondly, it can improve your overall skills as a programmer, especially in problem solving. So yes, there are times when it is better to write your own, and other times when using someone else’s solution is the best approach. You may even decide to compromise by checking other people’s solutions to help you write you own. There’s no right or wrong way to approach it. Take the approach that suits you or the situation best.

Digital Cityscape – Fractal Art

Digital Cityscape - Fractal art

Digital Cityscape – Fractal art

Created using Ultra Fractal 5.04. You can see a higher quality render of the artwork here or by clicking the image above.

Contains 6 individual layers in total, utilising one fractal formula and three different colouring algorithms. Might experiment with this particular fractal formula again in future, as it has some interesting properties for geometric fractals.

Prime numbers, a simple function in C++

A prime number is simply a number that is only divisible by one and itself. One of the most basic ways to find out if a number is prime, then, is to loop through every number (except for 1, as all positive integers are divisible by 1), and return true if the number is equally divisible by any other number except for itself.

This is a good example use of the modulus operator. The modulus operation returns the remainder of a division of two numbers. So, for example, 15 % 5 would equal 0, as 5 goes into 15 exactly 3 times with no remainder. 17 % 5 on the other hand would equal 2. As 5 goes into 17 3 times, with 2 being left over.

How it works:

Knowing this, and knowing what makes a prime number, we can check if a number is prime by checking every number between two and itself minus one, and if any of these numbers give us a modulus result of 0, then it is exactly divisible by another number and therefore not prime. If no modulus results of 0 are found by the time i is equal to x-1, then we know x to be prime. As each integer would always be divisible by itself, we have no need to check whether this is true or not. If we did want to check this, then our loop conditional statement would be i <= x;

This type of function is known as a predicate. A predicate is simply a function that returns true or false. In this instance, given an input of x, it will return true if the number is prime, and false if it is not.

If you wanted to implement this function and check every number beyond 1 to see if it was prime, you could do it like so;

This will iterate indefinitely until the maximum value that is able to be held in unsigned long int on your machine is reached. At that point, it will wrap around back to 0. You can use a larger data type, such as long long int, if your compiler supports it. But as you'll see, the main drawback of this function is that the larger the number gets, the more numbers it needs to check itself against to determine whether or not it is prime, so you may never reach that limit as it will take progressively longer each time to check the next number in the infinite while loop. The speed at which the program runs will depend on the processing power of the machine you are running it on.

If you needed to quickly check if a number is prime without needing to calculate it, then you could store the results in a text file, and check to see if that number exists in the file. If you store the results numerically, then you can stop searching once you reach the next prime number in the file beyond the one you're searching for. This is a very simple example of a lookup table, which saves computation time by storing the results of an operation rather than having to re-calculate them each time. This makes sense when you're performing a large number of checks of a particular calculation where the result is always going to be the same, and especially if you're going to be performing the same calculations many thousands of times over, and performance is a factor.

There are much more efficient techniques out there for finding prime numbers too, known as prime number sieves. One such algorithm is known as the Sieve of Eratosthenes, and I may follow up this post in future with a discussion on that.