Information Overload!: Databases and SQL

Monday's lecture was about databases, in particular looking at SQL (pronounced either as the letters, or as 'sequel' depending on how you feel about it). We had an introduction to databases to bring everybody up to speed - I was surprised how much I remembered from GCSE IT. My teacher would have been proud. We discussed when a database is a good idea, and when it's not such a good idea. A rule of thumb seems to be that it's a good plan when it's your data as opposed to someone else's. For that it might be a better idea to think about information retrieval (that's next week's lecture!).

It is a good idea to have a centrally managed and designed database. Problems of the past still haunt our offices today - it's now so easy to make a database a lot of people still use local ones rather than having a nicely linked up central database. This causes all manner of problems due to inaccurate data (not to mention legal issues like Data Protection!). Instead, organisations should have their data stored in one place, with each department accessing only the data that is relevant to them. The Database Management System (DBMS) controls access to the data by various departments / people.

So to be clear, the DBMS is the software which lets the users get at the data, whereas a database, to quote my course handout, is an 'integrated collection of data shareable between users and application systems'. So it's the stuff behind the scenes rather than the shiny interface the users get to see. Using a DBMS is advantageous as it assures information integrity (insofar as it gets updated everywhere when it's updated once). It also removes the reliance on a certain program as it is a collection of data which can be used by more than one application.

So what should it look like? Perhaps the way to explain this best is to say what you don't want. You don't want duplication within a table, and the best way to make sure this doesn't happen is to look at relationships and break up your data into different tables. I'll give you an example. Let's say we have some form of music database, with song titles, album names, artists and release date. So you'd have a table that looks like this:

Song_Title	Album_Name	Artist	Release_Year	Genre
Don't Marry Her	Blue is the Colour	The Beautiful South	1996	Pop
Perfect 10	Quench	Beautiful South	1998	Pop
Be Here to Love Me	Feels Like Home	Norah Jones	2004	Chill Out
Sunrise	Feels Like Home	Jones, Norah	2004	Chillout
Never Gonna Give You Up	Greatest Hits	Risk Astley	2002	80s

There's a few things to notice about this table. Firstly, did you notice how I put underscores in the titles? This is because it's generally not a good idea to put space in your titles - lots of programs don't like it. Just because Microsoft Access lets you do it, it doesn't mean it's a good idea! So use underscores, make it all one word, remember the user doesn't see this stuff.

This is probably a good time to mention primary keys. Did you notice that there isn't any one field that you could use to uniquely identify a song? Every table needs one of these, and the good news is most DBMSs can create these for you. Primary keys are important when it comes to making separate tables, which I will come to in a minute!

So back to the music, as it were. There are lots of duplications, for example the album name, the artist, the release date and the genre! This can be fixed by dividing the data up into the following tables:

Did you also notice the errors? The Beautiful South perform two of these songs but the band name is different in two rows. This is why having duplication in one table can be such a bad idea, and this is why we need more than one table.

Song

Song_ID	Song_Title	Album_ID
1	Don't Marry Her	1
2	Perfect 10	2
3	Be Here to Love Me	3
4	Sunrise	3
5	Never Gonna Give You Up	4

Album

Album_ID	Album_Name	Artist_ID	Release_Year
1	Blue is the Colour	1	1996
2	Quench	1	1998
3	Feels Like Home	2	2004
4	Greatest Hits	4	2002
5	Nobody's Prefect	5	2001

Artist

Artist_ID	Artist_Name	Genre_ID
1	The Beautiful South	1
2	Norah Jones	3
3	Jimmy Eat World	2
4	Rick Astley	5
5	The Cheapskates	4

Genre

Genre_ID	Genre
1	Pop
2	Rock
3	Chill Out
4	Ska
5	80s

This removes the duplication issue. Each table has a primary key, or unique identifier. In some of the tables, a primary key from a different table is in a column to match tables together. For example, in the table 'Artist' there is a column for 'Genre_ID'. When an identifier is in another table, it is known as a foreign key.

Now to look at some SQL and query our database! This is the most important bit. When searching you want to get out helpful data, and that means getting what you want and not getting what you don't want. It sounds obvious, but people often sift through lots of rubbish. Think about searching Google - it's a great tool but if you don't put in a sensible search you can get a lot of irrelevant results. If you don't believe me, try searching 'tablet' - you want to see what you could get instead of an iPad, but you end up with lots of things you didn't want, don't you?

Anyway, before I get sidetracked and start thinking about iPads, let's get back to SQL. To search, the basic syntax you need looks like this:

SELECT fields
FROM tables;

So if we were searching the database above for an album and its release date, we'd need:

SELECT Album_Name, Release_Year
FROM Album;

Which would give you:

Album_Name	Release_Year
Blue is the Colour	1996
Quench	1998
Feels Like Home	2004
Greatest Hits	2002
Nobody's Prefect	2001

When using select, there is no limit to the number of fields you can put in there. However, if you want everything then you can use an asterisk:

SELECT *
FROM Artist;

Which would give you:

Artist_ID	Artist_Name	Genre_ID
1	The Beautiful South	1
2	Norah Jones	3
3	Jimmy Eat World	2
4	Rick Astley	5
5	The Cheapskates	4

This is good if you want to see everything, but realistically an end user doesn't want to see the foreign keys in this table, they want the data, and this is where you need to start making your queries a bit more interesting. It isn't as obvious as it first looks, which can cause people some problems. If you wanted the genre of all the albums, you might this of tying in:

SELECT Artist_Name, Genre,
FROM Artist, Genre;

It seems logical, but this would actually give you a table with everything in it - every artist next to every genre. This is because computers aren't smart enough to know that you actually want the relevant Genre. It just pulls everything out of the database and dumps it in front of you. Instead, you need to tell it what you want by typing in this:

SELECT Artist_Name, Genre,
FROM Artist, Genre
WHERE Artist.Genre_ID = Genre.Genre_ID;

This will bring back a list where the Genre ID matches on both tables, and you get your answer. There's some new syntax in this code. There is a comparison operator '=' in the WHERE clause. There are a few of these, including (but not limited to)

= equal to
< less than
> greater than
<= less than or equal to
>= greater than or equal to
!= not equal to

The important thing to know about using the WHERE clause is that numbers do not require quotation marks, but any other characters do. So if you are looking for a date you would only type in:

WHERE Release_Year = 1998

If you are looking for text then you need quotation marks:

WHERE Song_Title = 'Sunrise'

Another thing to be aware of when using the WHERE clause is a wildcard search. This means you can truncate a word, so if you know that 'Nobody's Prefect' is in there somewhere, but you aren't sure if the database uses the apostrophe or not, you could search for this:

WHERE Album_Name = 'Nobody%'

This brings you everything which has Nobody at the beginning. You can use the % at the beginning and end if you like, so for song titles with the letter P in them you could search for '%p%' (although I'm not sure why you would need to!).

Finally, you can use WHERE more than once, for example you want songs from a certain artist from a certain year. You want to know if there is anything in the database by The Beautiful South after 1997. The syntax would be:

SELECT Artist_Name, Album_Name, Release_Year
FROM Album, Artist
WHERE Artist.Artist_ID = Album.Artist_ID
AND Artist_Name = 'The Beautiful South'
AND Release_Year >= 1997;

Did you notice how some of the fields say the table they are in and some don't? This is when the same field is in more than one table (remember foreign keys?) You need to define where you want the data from otherwise you will confuse the computer and it will probably tell you it can't do it.

So to conclude querying two tables, the syntax generally looks something like:

SELECT (whatever fields you want)
FROM tableA, tableB
WHERE tableA.primarykey = tableB.foreignkey;

If you forget to do this then you will probably get a lot more results than you had counted on, so be careful!

A final note on results; if you get results where the same thing appears on more than one row (this is possible even with neat data tables) then you can make sure it only appears once. For example, if you wanted to know which albums were released before 2000, you would get The Beautiful South twice, as both their albums were release before 2000. To prevent this from happening, you can add in 'distinct', which means it will only list it once, whether the results bring them back twice, or two thousand times! So the syntax would be:

SELECT DISTINCT Artist_Name
FROM Album, Artist
WHERE Artist.Artist_ID = Album.Artist_ID
AND Release_Year < 2000;

Apologies for the length of this post - it's all those tables! I have tried to present this clearly, but the best way to understand using SQL is to have a go. When doing this in the lab myself I got very cross about adding in commas - they only go in between items not at the end (so above you would not put FROM Album, Artist, because there is a comma at the end and the computer doesn't like it).

Information Overload!

Thursday, 13 October 2011

Databases and SQL

No comments:

Post a Comment