Since I purchased some web space, I have been thinking how to get most out of this decision. I contacted Dr. Shiming Fang about the possibility of mirroring his “New Thread” website, as I have been a frequent visitor for more than four years (maybe too frequently and for too long).
Anyway, Dr. Fang generously allowed me to mirror his site and gave me some instructions (written by Slashdot and Squirrel. I think Fang may not know much about those technical details either). Following the suggestions, I downloaded almost all webpages from the New Thread using teleport pro (suggested by Squirrel). The New Thread is surprisingly big with more than 400MB materials. Not without pain, I uploaded materials to my website. That’s a full day’s job.
Then here came the trouble. The web service I have now only includes FTP, PHP/perl, Blog, and other common tools. They are preset and I have no right to change any settings. I can get ssh/shell access if I am willing to fax them a photo ID. But I would rather forsake that right.
However, in both Slashdot’s and Squirrel’s scenarios, they assume that you have complete access to the web server with administrator authority. That is, you are the person who set up the server. At least, I guess if I bought a dedicated server, I might have had these privileges. Otherwise, I am doomed.
If there is a will, there is a way. Maybe I should find some mirroring programs which can automatically copy files from server to server. The best would be script programs written in perl or php so that they can run on the server by one click. Yes, there is one–w3mir–written in perl. However, you must have shell access to install the w3mir, and the mirroring may be overkilling to my website. In addition, after having tried to install the w3mir and written some small programs in perl by myself, I gave up. It seemed to me I had some troubles running perl programs on my website.
But I knew I could run php on my website because my web blog software is written in PHP. Furthermore, because I have spent some time tweaking my blog software (which greatly improved my blog), I decided to write something in php.
The first route I tried was to copy all files under the New Thread directory to my website. It would be great if it can be done from server to server. Yes, there was a way to do it if I have FTP access to the New Thread website. Apparently, requesting FTP access is a little bit too much. Mirroring a site should not impose security burden to the mother site. In addition, requesting dangerous read/write access to the full New Thread site would prove nothing but my incompetence. This was not a right way.
A better route came out. I could retrieve the “what’s new” webpage, explore all the URL links, and download those files to my websites. It is the same way used by all webcrawers/spiders to download websites. It is straightforward.
I set out to write the program by my own. Unfortunately, writing all by myself is much harder than I thought. Pretty soon I was tired of struggling with little details such as quotes, parameters, and array indexing. But I have to admit php is better structured than perl.
Modern human beings shouldn’t reinvent the wheels. After realizing this eternal truth, I went online to find some free scripts. There we go. There were a couple snippets doing these sort of things. I toyed with several of them for several hours and finally I picked the best one. By running this little program, I obtained a list of webpages to download.
Then I need to download and copy those files. I used fopen() to open remote webpage (from the New Thread). I also tried fget(), fwrite(), ftp_put etc. The program had only several lines but took me two days to figure out the problem. My website doesn’t allow me to open remote files. The option “allow_url_open” has been turned off in my website. That sucks.
Retreating to the bed in the middle of night, I decided to give up the whole mirroring idea. If I can’t download the remote files, there is no easy way to mirror the site. I don’t want to manually copy individual files.
All of a sudden, I recalled that perl can be run in local PC like a common program language. I used perl in constructing R package before. If perl can do that, PHP also can. Then if PHP is running in my local PC (no personal web server allowed due to security reasons), maybe I can download webpages from the New Thread and upload them to my website. This idea struck me so hard that for almost an hour I could not sleep, mentally composing my programs frantically.
This morning, I was glad to discover that PHP indeed works like perl. There is a command line tool. In no time, I installed PHP and checked the PHP.ini to make sure “allow_url_open”=on. I tested my URL exploring program, it worked perfectly. I ran the download program, it worked again, impeccably. This is great!
A couple days ago, I wrote a FTP upload program in PHP for uploading my images and text/html files to my website without physically opening FTP clients. It is my first php program. Very handy and useful. After a little cut and paste from that program, I assembled a big program which includes: reading remote html files, parsing and obtaining url lists, logging on my FTP server (the same as my website), one by one reading webpages in the url lists and uploading them to my server. Everything was automatically processed. Wonderful!
Sure there were more to do. A real mirroring tool should compare the modifying time between original and mirrored files to reduce network burden. It should be able to create new directory if it doesn’t exist on the mirror site. It should report the progress. It should have many error checks. I spent another several hours to add a couple functions, tide up program, and enthusiastically mirror the site again and again. It never failed after the first success.
I sent an email to Dr. Fang and posted a notice in the New Thread announcing the incoming of my fabulous mirror site. The whole story is perfect.
In summary, for those who are fortunate to own a website and domain, thus having some space and bandwidth available, here is the procedure to mirror the New Thread.
1) create a subdomain or sub directory under your main domain. This directory is for the mirroring purpose.
2) download all webpages from the New Thread Website to your local PC and upload them to the mirroring directory. I suggest using teleport pro to download and wsftp to upload files. Be sure maintain all directory structures.
3) create an FTP account specifically for the mirroring purpose. That is, the default FTP directory should be the mirroring directory.
4) download and install php package in your own local PC. I suggest you install php under the C:\php or D:\php to avoid any problems associated with window directory problems.
5) Make sure “allow_url_open” = 1 in the php.ini file. Read the php installation text.
6) obtain a copy of mirrorxys.php from me (can be downloaded from my website ). Put the mirrorxys.php in the php directory to avoid any hidden conflictions and settings.
7) change FTP server settings in mirrorxys.php, Make any modifications as you wish
run the mirrorxys.php using command “php mirrorxys.php” to mirror all new materials in the “what’s new” pages. using “php mirrorxys.php -f” to mirror forum posts.
9) It’s done. Just run it every day.
For some reasons, the mirroring program runs slowly. I guess because I used FTP upload files one by one. There are quite a lot back-forth transactions between sites.
For your information, the mirrorxys program is my second php program written in notepad (what a pity!). Don’t expect more. And be polite, give suggestions but not critiques.