Porting my blog for the second time, walk the old data part 3

This is post #15 of my series about how I port this blog from Blogengine.NET 2.5 ASPX on a Windows Server 2003 to a Linux Ubuntu server, Apache2, MySQL and PHP. A so called LAMP. The introduction to this project can be found in this blog post /post/Porting-my-blog-for-the-second-time-Project-can-start.

In my previous post I started walking the directory and after 37 posts it crashed and I discovered I had forgotten about the built-in method you can use to insert images in BlogEngine.NET. The image URL of these images looks like this

≺code class="xml plain"≻≺≺/code≻≺code class="xml keyword"≻img≺/code≻ ≺code class="xml color1"≻src≺/code≻≺code class="xml plain"≻=≺/code≻≺code class="xml string"≻"/image.axd?picture=2011%2f9%2fJens+Blog+QR.png"≺/code≻ ≺code class="xml color1"≻alt≺/code≻≺code class="xml plain"≻=≺/code≻≺code class="xml string"≻""≺/code≻ ≺code class="xml plain"≻/≻≺/code≻

There were several issues with this. My routine detecting if a link is going to become a local link did not see this. I changed that into this:

# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-images-part-4.aspx
$bIsLocalURL = $tag =~ /(malmgren.nl|blogspot.com|googleusercontent.com|/image.axd)/i;

The tag detection routine only looked at URLs starting with HTTP and now it had to look for links starting with /image.axd as well.

# Replace an URL with an URL place holder. Store the URL in dictURLToDatabaseID so that
# later on it is possible to search for URLs a second time.
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-images-part-3.aspx
my $iDatabaseIDofThisURLentry = -1;
if ($tag =~ /(.*?)(https?://|/image.axd?)(.+?)["']/i)
{
	my $strParsedURL = $2.$3;
	$strTarget = uri_decode($strParsedURL);
	$strTarget =~ s//image/http://www.jens.malmgren.nl/image/;

In case I found this /image... links I had to make the link absolute so that I could fetch the image file with LWP::Simple. I did this with a search replace operation.

It worked! I could start walking again. Now my program could walk all the way to post 60 and downloading 174 images.

So now this is getting tedious. I know there is an error but I don't know exactly where the error is, how far the program has come in processing the URLs. So before I can find the error I need a rock solid method of calculating the offset into the content I am processing. For this I had to create a new variable $contentParsed storing the part of the content that we already parsed.

# URL and Image processing
# Here we parse the post content. Extract URLs and store these in the URL entity.
# Download images to be stored locally. Find and replace all URLs with URL place holders.
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-images-part-1.aspx
my $iUrlCount = 0;
my $content = $dictStringFieldToValue{"content"};
my $contentParsed = "";
my $contentResult = "";
while ($content =~ /^(.*?)(≺a.+?≻|≺img.+?/≻)/msi)
{
	print "~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
";
	print "~-~-~-~-~-~ PROCESS URL ~-~-~-~-~-~-
";
	print "~-~-~-~-~-~-~- START -~-~-~-~-~-~-~-
";
	
	$iUrlCount++;
	$content = substr($content, length($1 . $2));
	$contentParsed .= $1;
	$contentResult .= $1; # Wait with saving the tag until after it has been replaced with a place holder.
	my $tag = $2;
	print "Tag at offset : " . length($contentParsed =~ s/
/
/r) . "
";
	$contentParsed .= $tag;
	$bIsAlreadyLoadedInAnotherPost = 0;

On Linux the default line separator is one linefeed but on windows it is carriage return and a line feed. If I calculate the offset based on what I am reading in Linux and then view the file on a windows machine then I don't get spot on the location. So I replace with so that the offsets are windows compatible. I do this with the non destructive substitution modifier /r.

So when I had done this it was really easy to find out I had yet another method for storing images on my previous blog, images I stored locally and most of these I stored in a subdirectory called Images.

So we already had the situation /image... covered and now we have /Image... So that is almost the same!

# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-images-part-4.aspx
$bIsLocalURL = $tag =~ /(malmgren.nl|blogspot.com|googleusercontent.com|/image)/i;

With case insensitive search /i it is possible to find both cases.

# Replace an URL with an URL place holder. Store the URL in dictURLToDatabaseID so that
# later on it is possible to search for URLs a second time.
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-images-part-3.aspx
my $iDatabaseIDofThisURLentry = -1;
if ($tag =~ /(.*?)(https?://|/image)(.+?)["']/i)
{
	my $strParsedURL = $2.$3;
	$strTarget = uri_decode($strParsedURL);
	$strTarget =~ s//([iI])mage/http://www.jens.malmgren.nl/$1mage/;

Here I can find both i and I with [iI] and then I save it in a parenthesis so that when I replace the target URL so that it becomes absolute then I insert the correct casing i.

So now I can start walk again...

Now I came to post 82 before the program choked.

~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
~-~-~-~-~-~ PROCESS URL ~-~-~-~-~-~-
~-~-~-~-~-~-~- START -~-~-~-~-~-~-~-
Tag at offset : 9610
Already present: https://lh5.googleusercontent.com/-2L5JYN1sxy8/TzgaXqHaeQI/AAAAAAAABjU/culLZDw1BVQ/s1600/DSC_2738.JPG|a
DBD::mysql::db do failed: Duplicate entry '82-225' for key 'PRIMARY' at /usr/local/bin/analyze.pl line 314.
DBD::mysql::db do failed: Duplicate entry '82-225' for key 'PRIMARY' at /usr/local/bin/analyze.pl line 314.

This was slightly harder problem. What MySQL is saying here is that for post ID 82 we already created a link to URL 225. So with other words if there is a post using the same URL twice in the same post then it is not possible to setup another link. This is logical. Had not thought about it before. I could solve this in a complex way or in an easy way. I go for the easy way. That is to "ignore" the problem. Found the way to do that with DBI in Perl.

# Ignore duplicates of links between a Post and URL because that indicated that the same URL is used twize in a post.
# http://www.jens.malmgren.nl/post/Porting-my-blog-for-the-second-time-walk-the-old-data-part-3.aspx
eval {
	$dbh-≻do(
		'INSERT INTO PostURL (URLID,                    PostID  )' .
		'VALUES 		     (?,  	                    ?       )' , undef,
							  $iDatabaseIDofThisURLentry,$postID);
	};
print "PostURL already available. PostID: " . $postID . ", URLID: " . $iDatabaseIDofThisURLentry . ". Message: $@
" if $@;

And the result looks like this:

~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
~-~-~-~-~-~ PROCESS URL ~-~-~-~-~-~-
~-~-~-~-~-~-~- START -~-~-~-~-~-~-~-
Tag at offset : 9610
Already present: https://lh5.googleusercontent.com/-2L5JYN1sxy8/TzgaXqHaeQI/AAAAAAAABjU/culLZDw1BVQ/s1600/DSC_2738.JPG|a
DBD::mysql::db do failed: Duplicate entry '1-1' for key 'PRIMARY' at /usr/local/bin/analyze.pl line 316.
PostURL already available. PostID: 1, URLID: 1. Message: DBD::mysql::db do failed: Duplicate entry '1-1' for key 'PRIMARY' at /usr/local/bin/analyze.pl line 316.

This is when I tried only loading this post so here the PostID was 1 and URL ID was 1 as well. Now I can walk again...

And how that walking ended and where... that is for the next time.