Linear Regression In PHP (part 2)

In the last post we had a simple stepping algorithm, and a gradient descent implementation, for fitting a line to a set of points with one variable and one ‘outcome’. As I mentioned though, it’s fairly straightforward to extend that to multiple variables, and even to curves, rather than just straight lines.

Read More

Benfords Law

Benfords Law is not an exciting new John Nettles based detective show, but an interesting observation about the distribution of the first digit in sets of numbers originating from various processes. It says, roughly, that in a big collection of data you should expect to see a number starting with 1 about 30% of the time, but starting with 9 only about 5% of the time. Precisely, the proportion for a given digit can be worked out as:

Read More

Monte Carlo Simulations

Monte Carlo simulations are a handy tool for looking at situations that have some aspect of uncertainty, by modelling them with a pseudo-random element and conducting a large number of trials. There isn’t a hard and fast Monte Carlo algorithm, but the process generally goes: start with a situation you wish to model, write a program to describe it that includes a random input, run that program many times, and look at the results.

Read More

Bayesian Opinion Mining

The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people’s opinions from all this raw text can be a very powerful one, and it’s a well studied area - no doubt because of the commercial possibilities.

Read More

PageRank In PHP

Google was a better search engine than it’s predecessors for a number of reasons, but probably the most well known one is PageRank, the algorithm for measuring the importance of a page based on what links to it. Though not necessarily that useful on its own, this kind of link analysis can be very helpful as part of a general information retrieval system, or when looking at any kind of network, such as a friend graph from a social network.

Read More

Text Generation

After a rather technical post last week, something a bit lighter. Text and language generation is a fun topic with applications that run from randomly generating scientific papers for conferences, to the practical tasks of generating speech and automated responses. In this post we’ll look at how we can generate some nonsense text based on existing documents, which isn’t on the overly practical side, though it can make a fun change from Lorem Ipsum for holding copy. The code is throughout, but you can also grab the lot in a zip.

Read More

Support Vector Machines In PHP

When it comes to classification, and machine learning in general, at the head of the pack there’s often a Support Vector Machine based method. In this post we’ll look at what SVMs do and how they work, and as usual there’s a some example code. However, even a simple PHP only SVM implementation is a little bit long, so this time the complete source is available separately in a zip file.

Read More

Part Of Speech Tagging

Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.

Read More