Project Phoenix

Project Phoenix

Rebuilding and expanding a mission‑critical Solaris/SPARC cluster across multiple sites.

Overview

The client
A major pan‑European financial institution serving around 15 million clients across Italy, Germany, Austria, and Central & Eastern Europe.
Its markets and investment banking division operates at multi‑billion‑euro scale, supporting international trading, treasury, and structured finance activities. The organisation is known for its large, complex technology landscape, combining legacy platforms with modern digitalisation initiatives to support high‑volume, real‑time financial operations.
The Project
To deliver a full relocation and expansion of a Tier‑1 business‑critical trading cluster from Germany to Italy. Replacing the original 2-node active/active (prod/DR) environment into three 2 node clusters spanning production, disaster recovery, and a dedicated non‑production replica.
Application Stack
The platform provided front‑to‑back support for repo and securities‑lending operations, combining trade capture, pricing, position management, and lifecycle processing in a single integrated system.
It was used across front‑office, middle‑office, and back‑office teams, supporting execution, risk, settlement, collateral, and operational workflows.
The system was tightly integrated with a market‑connectivity layer that handled electronic interfaces to repo markets and trading venues, enabling real‑time interaction with external liquidity sources.
Note:
All system names, domains, client references, and geographic details have been anonymised for confidentiality.

Background

The existing environment consisted of:
  • A 2-node active/active cluster, at 2 german data centers
  • Each node had capacity to handle the full load, therefore they both acted as Production & Disaster Recovery
  • Sun SPARC Enterprise M4000 servers (circa 2007–2008)
  • Oracle Solaris 10
  • Solaris Cluster 3.x
All components were long past Premier Support, with hardware EOSL in the mid‑2010s and Solaris 10/Cluster 3.x only receiving limited Extended Support until 2027.

Key Implications of the Legacy Stack

  • No new patches, firmware, or vendor fixes
  • Deeply embedded customisations, operational tooling, and additional components unavailable for reinstall
  • No ability to reinstall or rebuild from scratch
  • Hardware too old and fragile to be physically relocated
  • Cluster version incompatible with modern hardware
  • Zero vendor support for troubleshooting

Project objectives

The Project was split into three major phases:
Phase 1 - New Production Cluster
Replicate the 2 node cluster to Italian Data Center 1, active/active, as Production only
The german cluster then becomes the Disaster Recovery site
Phase 2 - New Disaster Recovery Cluster
Replicate the new production cluster in Italy, to data center 2.
Work out how to synchronzie data between the 2 clusters for failover
This 2nd cluster becomes the Disaster Recovery site.
The Germany cluster can then be decommissioned
Phase 3 - Third Copy Cluster
Create another copy of the production cluster in Italian Data Center 3.
This becomes inactive 3rd copy.

Major challenges

1. Hardware too old to move or power‑cycle safely

The original SPARC M4000 systems were at high risk of failure if transported or even rebooted. This forced a strategy of non‑intrusive extraction, cloning, and remote analysis.

2. No installation media for critical software

Many components of the original application stack and cluster configuration were no longer available from the vendor, not stored internally, and not reproducible from scratch. This made image‑based cloning and rsync‑driven reconstruction the only viable method.

3. Zero vendor support

With hardware EOSL and Solaris 10/Cluster 3.x in limited Extended Support, there was effectively no vendor assistance, no updated documentation, no new patches, and no troubleshooting help. Every step — from cluster reconstruction to storage provisioning — had to be researched, tested, and validated manually.

4. Rebuilding a cluster without reinstalling Solaris

Because fresh installation was impossible, the new clusters had to be built from system images, re‑parameterised, re‑networked, re‑clustered, re‑quorumed, and re‑storage‑mapped, all without breaking compatibility with the legacy application stack.

5. Multi‑site, multi‑cluster consistency

Three clusters had to behave identically despite different hardware, storage arrays, network topologies, interconnects, and site‑level constraints. This required repeatable automation, custom scripts, and extensive failover testing.
Metadbs not being recognised after replication

Key technical work

System imaging and reconstruction

  • Created Solaris FLAR images from the original systems to capture a consistent baseline.
  • Rebuilt environments on newer SPARC hardware using image‑based deployment.
  • Phase 1: Used rsync to migrate application, configuration, and database data.
  • Phase 2+3: data synchronised via SAN level LUN synchronization.

Cluster rebuild

  • Reconstructed Solaris Cluster configuration from exported definitions.
  • Re‑created resource groups, HAStoragePlus, GDS, and application services.
  • Re‑established quorum devices and private interconnect networks.
  • Validated fencing, failover, and recovery behaviour end‑to‑end.

Storage and filesystem engineering

  • Replicated all SVM metasets and metadevices: 16 LUNs, 3 metasets.
  • Mapped new LUNs and rebuilt storage configurations on the target arrays.
  • Recreated filesystems with application‑specific parameters for database workloads.
  • Ensured compatibility with existing database and application expectations.

Network and redundancy

  • Implemented IPMP for network interface failover and redundancy.
  • Rebuilt interconnect networks with strict isolation to avoid cross‑cluster interference.
  • Reconstructed resolver, LDAP, and related service configurations for the new sites.

Testing and validation

  • Performed controlled failover simulations across all clusters.
  • Validated cluster membership, quorum behaviour, and fencing logic.
  • Ensured application services behaved consistently across all three environments.
  • Documented procedures to support future migrations and operational tasks.

Outcome

Despite the absence of vendor support, missing installation media, and the fragility of the original hardware, the project delivered three fully functional Solaris/SPARC clusters with identical behaviour across environments. The work provided a safe migration path away from EOSL hardware, a validated disaster‑recovery strategy, and a reproducible process for future rebuilds and migrations.

This engagement required deep knowledge of Solaris internals, clustering, storage, and legacy systems, as well as extensive problem‑solving in an environment with no practical vendor safety net.


Technical Implementation

The following section provides a reconstructed and sanitised walkthrough of the technical steps involved in rebuilding and migrating the legacy Solaris/SPARC cluster environment. All hostnames, domains, and client‑specific identifiers have been removed. Disk sizes, set names etc, also have been changed. But not the number of devices or resources - so the complexity of the project remains unaltered.

Data Collection

Hosts
  1. Original German Production/DR Cluster: originhost01, originhost02
  2. New Italian Production Cluster: prodhost01, prodhost02
  3. New Italian Disaster Recovery Cluster: prodhost03, prodhost04
  4. New Italian 3rd Copy Cluster: prodhost05, prodhost06
Storage - LUNS
Number x Size of LUNs, provisioned for each of the 4 clusters
  1. 2 x 1GB
  2. 1 x 5GB
  3. 1 x 50GB
  4. 8 x 100GB
  5. 2 x 250GB
  6. 2 x 500GB
Metasets, metadevices and filesystems
  1. Application: APP-DS:
    d100 -m d101 (2 x 150gb stripe) -> /opt/app
  2. Database Metaset: DBSID-DS
    d110 -m d111 (4 x 100gb stripe) -> /data/DBSID/sybase01
    d120 -m d121 (4 x 100gb stripe) -> /data/DBSID/sybase02
    d130 -m d131 (2 x 500gb stripe) -> /data/DBSID/backup
    d140 -m d132 (1 x 1gb) -> /opy/sybase/admin/DBSID
    f150 -m d151
  3. MQ File Transfer Edition metaset:MQFTE-DS
    d150 -m d151 (1 x 5gb) -> /data/mqfte/config
    d160 -m d161 (1 x 50gb) -> /data/mqfte/files
Cluster Resource Groups There are 3 cluster resource groups:
  1. APP-rg
  2. DBSID-rg
  3. MQFTE-rg
Cluster Resources
  1. Original German Production/DR Cluster: originhost01, originhost02
  2. New Italian Production Cluster: 16
  3. New Italian Disaster Recovery Cluster: prodhost03, prodhost04
  4. New Italian 3rd Copy Cluster: prodhost05, prodhost06

Phase 1 - New Production Cluster

1. Capture Imaging of Primary Hosts

1.1 Create FLAR images of the original SPARC systems This is the first time critical event that had to be arranged with the bank business side.
The system had to be taken down to single user mode, to ensure minimal number of files are altered.
Network interface and routing only started, to allow the image to be written to nfsserver.
# flarcreate -x  -x  -S -n originhost01 -L cpio /mnt/originhost01.flar
# flarcreate -x  -x  -S -n originhost02 -L cpio /mnt/originhost02.flar
1.2. Export Cluster Configuration
# cluster export > /mnt/origHost1/clusterconfig.xml
1.3 Copy Disk configuration settings
# metaset | grep -i set
# metastat -s DB-DS -p > /mnt/origHost1/DB-DS.lst
# metastat -s APP-DS -p > /mnt/origHost1/APP-DS.lst 
# metastat -s MQ-DS -p  > /mnt/origHost1/MQ-DS.lst
# metadb -i > /mnt/origHost1/metadb-i.lst
# cat /etc/vfstab  > /mnt/origHost1/vfstab.lst
# cat /etc/hosts > /mnt/origHost1/hosts.lst
# echo | format > /mnt/origHost1/format.lst
# cfgadm -al > /mnt/origHost1/cfgadm-al.lst
# devfsadm -v > /mnt/origHost1/devfsadm-v.lst
# multipath list LU > /mnt/origHost1/multipath-list.lst
# luxadm probe > /mnt/origHost1/luxadm-probe.lst
# scdidadm -l  > /mnt/origHost1/scdidadm-l.lst

2. Hardware Preparation

  • Rack and cable new SPARC systems
  • Install FC and network cards
  • Configure ILOM/XSCF
  • Request switch ports and validate connectivity
  • Insert Solaris 10 installation media

3. OS Installation on New Hardware

3.1 Connect to the eXtended System Control Facility (XSCF)
ssh prodhost01-rsa -l admin
XSCF> poweron -a
XSCF> showhardconf
3.2 Connect to the domain console
XSCF> console -d0
3.3 Confirm Disks are ok, and then boot off cdrom
{0} ok probe-scsi-all
{0} ok boot cdrom
3.4 The Solaris Installation Program
Networked  [X] Yes
Network interfaces  [X] nxge3
Use DHCP: No
Host name: prodhost01
IP address: 10.10.10.121
System part of a subnet: Yes
Netmask: 255.255.255.0
Enable IPv6: No
Default Route: Specify one
Router IP Address: 10.10.10.1
System identification complete.
Starting Solaris installation program...
Executing JumpStart preinstall phase...
Searching for SolStart directory...
Checking rules.ok file...
Using begin script: install_begin
Using finish script: patch_finish
Executing SolStart preinstall phase...
Executing begin script "install_begin"...
Begin script install_begin execution completed.
3.5 Exit installer, configure netmask, NIC/VLAN/IP and nfs mount
Press F5 (or ESC-5) to exit the installer:
If you exit the Solaris Interactive Installation program, your
profile is deleted. However, you can restart the Solaris
Interactive Installation program from the console window.
F2_Exit Installation    F5_Cancel
Press F2 (ESC-2) to continue to a shell prompt:
To restart the Solaris installation program,
type "install-solaris".
Solaris installation program exited.
# echo "10.0.0.128 255.255.255.224" >> /etc/netmasks
# echo 10.0.0.130 nfsfiler >> /etc/hosts
# ifconfig nxge3777 plumb
# ifconfig nxge3777 down
# ifconfig nxge3777 10.0.0.131 netmask 255.255.255.224 broadcast +
# ifconfig nxge3777 up 
# route add net default 10.0.0.1
# mount nfsfiler:/import/SPARC /mnt
3.6a Troubleshooting
# ls -la /mnt/originhost01.flar
ls: can't read ACL on /mnt/originhost01.flar: Permission denied
# chown nobody:nobody /mnt/originhost01.flar
# getfacl /mnt/originhost01.flar
# file: /mnt/originhost01.flar
# owner: nobody4
# group: nogroup
user::rwx
group::rwx              #effective:rwx
mask:rwx
other:rwx
# mount -o vers=3 nfsfiler:/import/SPARC/ /mnt
3.7 Continue Installer
# install-solaris
Select install from flar:
First DR host - select nfsfiler:/import/SPARC/originhost01.flar
2nd DR host - select nfsfiler:/import/SPARC/originhost02.flar
Select disk, and other OS install options:
Installation Option: Flash
Boot Device: c0t0d0
Root File System Type: ZFS
Client Services: None
Software: 1 Flash Archive
local file: originhost01.flar
Pool Name: rpool
Boot Environment Name: s10s_u11wos_24a
Pool Size: 858407 MB
Devices in Pool: c0t0d0
c0t1d0
Preparing system for Flash install
Configuring disk (c0t0d0)
        - Creating Solaris disk label (VTOC)
Configuring disk (c0t1d0)
        - Creating Solaris disk label (VTOC)
        - Creating pool rpool
        - Creating swap zvol for pool rpool
3.8. Reconfiguring OS Cloned image
# mkdir /tmp/A
# zfs set mountpoint=/tmp/A rpool/ROOT/s10s_u11wos_24a
# zfs mount rpool/ROOT/s10s_u11wos_24a
# cd /tmp/A/etc
# vi passwd   (add adminx user)
# vi shadow   (add adminx user)
# vi /etc/sudoers (add adminx user)
# vi /etc/nsswitch.conf (change to files, as ldap will not work on new network)
# zfs umount rpool/ROOT/s10s_u10wos_17b
# zpool export rpool
# sync;sync; halt
{0} ok boot -x

4. Post-Installation Cleanup

System is in single-user maintenace mode
4.1 Update IP address, netmask, hostname, nodename, vips
Set originhosts to loopback address:
echo 127.0.0.1 originhost01  originhost01.domain.net >> /etc/hosts
echo 127.0.0.1 originhost02  originhost02.domain.net >> /etc/hosts
4.2 Hash out all SVM mounts in
vi /etc/vfstab
4.3 Update DNS servers and domain search paths
vi /etc/resolv.conf
4.4 Disable ldap
svcadm disable ldapclient
4.5 Reconfigre ldap for new location

4.6 Restart ldap
svcadm enable ldapclient
4.7 Identify HBA WWNs
List the connected HBA’s:
root@prodhost01:~ 12:46:24 luxadm -e port |grep CONNECTED
/devices/pci@1,700000/SUNW,qlc@0/fp@0,0:devctl                     CONNECTED
/devices/pci@3,700000/SUNW,qlc@0/fp@0,0:devctl                     CONNECTED
Verify FC ports are connected and configured:
root@prodhost01:~ 12:48:36  cfgadm -al -o show_FCP_dev |grep fc-fabric
c1                             fc-fabric    connected    configured   unknown
c2                             fc-fabric    connected    configured   unknown
Request SAN Storage to be provisioned to thise WWNs.

4.8 Create a backup boot environment
#  lucreate -n s10u11_ProdImage.01clean

5. Cluster Clean up - remove nodes

5.1 Remove each host from the old cluster
# clnode remove
Verifying that no unexpected global mounts remain in /etc/vfstab ... done
Verifying that no device services still reference this node ... done
Archiving the following to /var/cluster/uninstall/uninstall.29656/archive:
    /etc/cluster ...
    /etc/path_to_inst ...
    /etc/vfstab ...
    /etc/nsswitch.conf ...
Removing the private hostname from "ntp.conf.sc" on node "orginhost01" ...done
Removing the private hostname from "ntp.conf.sc" on node "orginhost02" ...done
dumb: Unknown terminal type
clnode:  Unable to remove "etc/cluster/nodeid" entry from the boot archive ("/boot/solaris/filelist.ramdisk")

Attempting to contact the cluster ...
    Trying "orginhost01" ... timed out
    Trying "orginhost02" ... timed out
Unable to contact the cluster.
Additional housekeeping may be required to unconfigure
orginhost02 from the active cluster.

Removing the following:
    /etc/cluster ...
    /dev/global ...
    /dev/md/shared ...
    /.globaldevices ...
    /dev/did ...
    /devices/pseudo/did@0:* ...
The private host entry of this node has been removed from
/etc/inet/ntp.conf.sc, but the NTP service is still enabled. If you
have no further use for the NTP service, you can disable it after the
uninstall command has completed.

The /var/cluster directory has not been removed.
Among other things, this directory contains
uninstall logs and the uninstall archive.
You may remove this directory once you are satisfied
that the logs and archive are no longer needed.
Log file - /var/cluster/uninstall/uninstall.29656/log
# devfsadm -Cv
# init 6
5.2 Create another backup boot environment
# lucreate -n s10u11_ProdImage.02.nocluster

6. Create the New Cluster

6.1 Solaris Cluster Install:
# scinstall
 *** Main Menu ***
    Please select from one of the following (*) options:
      * 1) Create a new cluster or add a cluster node
        2) Configure a cluster to be JumpStarted from this install server
        3) Manage a dual-partition upgrade
        4) Upgrade this cluster node
      * 5) Print release information for this cluster node
      * ?) Help with menu options
      * q) Quit
    Option:  1

  *** New Cluster and Cluster Node Menu ***
    Please select from any one of the following options:
        1) Create a new cluster
        2) Create just the first node of a new cluster on this machine
        3) Add this machine as a node in an existing cluster
        ?) Help with menu options
        q) Return to the Main Menu
    Option:  1

  *** Create a New Cluster ***
    This option creates and configures a new cluster.
    You must use the Oracle Solaris Cluster installation media to install
    the Oracle Solaris Cluster framework software on each machine in the
    new cluster before you select this option.
    If the "remote configuration" option is unselected from the Oracle
    Solaris Cluster installer when you install the Oracle Solaris Cluster
    framework on any of the new nodes, then you must configure either the
    remote shell (see rsh(1)) or the secure shell (see ssh(1)) before you
    select this option. If rsh or ssh is used, you must enable root access
    to all of the new member nodes from this node.
    Press Control-D at any time to return to the Main Menu.
    Do you want to continue (yes/no) [yes]?  yes

  >>> Typical or Custom Mode <<<
    This tool supports two modes of operation, Typical mode and Custom
    mode. For most clusters, you can use Typical mode. However, you might
    need to select the Custom mode option if not all of the Typical mode
    defaults can be applied to your cluster.
    For more information about the differences between Typical and Custom
    modes, select the Help option from the menu.
   Please select from one of the following options:
        1) Typical
        2) Custom
        ?) Help
        q) Return to the Main Menu
    Option [1]:  2
    What is the name of the cluster you want to establish [sc_prodapp]?
    Node name (Control-D to finish):  prodhost01
    Node name (Control-D to finish):  prodhost02
    Do you need to use DES authentication (yes/no) [no]?
    Should this cluster use at least two private networks (yes/no) [yes]?
    Does this two-node cluster use switches (yes/no) [yes]?
    What is the name of the first switch in the cluster [switch1]?
    What is the name of the second switch in the cluster [switch2]?
    Select the first cluster transport adapter: nxge1
    Will this be a dedicated cluster transport adapter (yes/no) [yes]?  yes
 For node "prodhost01",
    Name of the switch to which "nxge1" is connected [switch1]?
 For node "prodhost01",
    Use the default port name for the "nxge1" connection (yes/no) [yes]?
   Select the second cluster transport adapter:nxge2
    Will this be a dedicated cluster transport adapter (yes/no) [yes]?
 For node "prodhost01",
    Name of the switch to which "nxge2" is connected [switch2]?
 For node "prodhost01",
    Use the default port name for the "nxge2" connection (yes/no) [yes]?
 For all other nodes,
    Autodiscovery is the best method for configuring the cluster
    transport. However, you can choose to manually configure the remaining
    adapters and cables.
    Is it okay to use autodiscovery for the other nodes (yes/no) [yes]?
    Is it okay to accept the default network address (yes/no) [yes]?
    Is it okay to accept the default netmask (yes/no) [yes]?
    Do you want to turn off global fencing (yes/no) [no]?
Global Devices File System 
    The default is to use lofi.
 For node "prodhost01",
    Is it okay to use this default (yes/no) [yes]?
For node "prodhost02",
    Is it okay to use this default (yes/no) [yes]?
    Configuring global device using lofi on prodhost02: done
    Is it okay to create the new cluster (yes/no) [yes]?
Interrupt cluster creation for cluster check errors (yes/no) [no]?
  Cluster Creation
    Log file - /var/cluster/logs/install/scinstall.log.10359

    Starting discovery of the cluster transport configuration.
    The following connections were discovered:
        prodhost01:nxge1  switch1  prodhost02:nxge1
        prodhost01:nxge2  switch2  prodhost02:nxge2
    Completed discovery of the cluster transport configuration.
    Started cluster check on "prodhost01".
    Started cluster check on "prodhost02".
    cluster check failed for "prodhost01".
    cluster check failed for "prodhost02".
The cluster check command failed on both of the nodes.
Refer to the log file for details.
The name of the log file is /var/cluster/logs/install/scinstall.log.10359.
    Configuring "prodhost02" ... done
    Rebooting "prodhost02" ... done
    Configuring "prodhost01" ... done
    Rebooting "prodhost01" ...
Log file - /var/cluster/logs/install/scinstall.log.10359
Note: nxge1 and 2 are the interconnects, there are no switches, it is done with crossover cables

7. Storage Configuration

7.1 Confirm HBA's are connected
List the connected HBA’s:
root@prodhost01:~ 12:46:24 luxadm -e port |grep CONNECTED
/devices/pci@1,700000/SUNW,qlc@0/fp@0,0:devctl                     CONNECTED
/devices/pci@3,700000/SUNW,qlc@0/fp@0,0:devctl                     CONNECTED

Verify FC ports are connected and configured:
root@prodhost01:~ 12:48:36  cfgadm -al -o show_FCP_dev |grep fc-fabric
c1                             fc-fabric    connected    configured   unknown
c2                             fc-fabric    connected    configured   unknown
7.1 Scan for the new LUNs
Note the disk IDs have the LUns have no apparent sequentional order in relation to the metadevices.
This is because storage had been added, removed, migrated, so many times over the years.
SAN storage administrators provided list of WWNs for all LUNS.
# cfgadm -c configure cX
# cfgadm -al
# cfgadm -al -o show_SCSI_LUN
# devfsadm -Cv
# scgdevs
Confirm can see all 16 new LUNs:
# luxadm probe | grep "Logical Path" | wc -l 
Confirm sizes of LUNs, and sort by number of each size:
luxadm probe |grep Logical|awk -F\: '{print"echo "$2";luxadm display "$2"|grep capacity"}'\
   |sh|grep capacity|awk '{print $3" "$4}'| sort | uniq -c |sort
1 5120 MBytes   (1 x 5GB)
1 51200 MBytes  (1 x 50GB)
2 1024 MBytes   (2 x 1GB) 
2 153600 MBytes (2 x 150GB)
2 512000 MBytes (2 x 500GB)
8 102400 MBytes (8 x 100GB)
Create a table with the LUN IDs, and sizes
LUN IDSize
<WWN_SAN_ID>000000000A1d01GB

7.2 Confirm all paths to Storage are active
Each LUN should have 4 paths and all operational:

mpathadm list lu
/dev/rdsk/c3t<WWN_SAN_ID>00A1d0s2
	Total Path Count: 4
	Operational Path Count: 4
Confirm all 16 Luns have 4 paths, and all operational:
mpathadm list lu | grep "Total Path Count: 4" | wc -l 
mpathadm list lu | grep "Operational Path Count: 4" | wc -l 
7.3 Format and Label each disk
Confirm there are 18 disks showing - 2 os internal disks + 16 LUNS:
# echo | format | wc -l 
Format, select each disk in turn, and label it.
# format c3t<WWN_SAN_ID>000000000A1d0
selecting c3t<WWN_SAN_ID>000000000A1d0
[disk formatted]
Disk not labeled.  Label it now? y
format> q
7.4 Add DID for each LUN to table Add DID to the LUN table:
LUN IDSizeDID
<WWN_SAN_ID>000000000A1d01GBd4

7.5 Determine the metasets for each disk
We have the LUN sizes, so next label which metaset each LUN belongs to:
LUN IDSizeDIDmetaset
<WWN_SAN_ID>000000000A1d0 1GB d4 quorum
<WWN_SAN_ID>000000000A2d0 1GB d10 DBSID-DS
<WWN_SAN_ID>000000000A3d0 5GB d5 MQFTE-DS
<WWN_SAN_ID>0000000012Cd0150GBd19APP-DS

7.6 Configure the Quorum
7.6.1 Check Initial status

clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---

            Needed   Present   Possible
            ------   -------   --------
            1        1         1

--- Quorum Votes by Node (current status) ---
Node Name        Present       Possible      Status
---------        -------       --------      ------
prodhost01       1             1             Online
prodhost02       0             0             Online
7.6.2 Add shared LUN to the quorum
clq add d4
7.6.3 Recheck Status

clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---
            Needed   Present   Possible
            ------   -------   --------
            2        3         3


--- Quorum Votes by Node (current status) ---
Node Name        Present       Possible      Status
---------        -------       --------      ------
prodhost01         1             1             Online
prodhost02         1             1             Online

--- Quorum Votes by Device (current status) ---
Device Name       Present      Possible      Status
-----------       -------      --------      ------
d4                1            1             Online
7.4 Create Slices for Metadb on the quorum disk Create 2 x 128mb slices on the quorum disk
Use the first for metadbs for host1
metadb -f -a -c 3 /dev/dsk/c3t000000000A1d0s0
Use the 2nd slice for metdbs on 2nd host
metadb -f -a -c 3 /dev/dsk/c3t000000000A1d0s1
7.5 Assign metadevice IDs and mount reference to LUNs table
LUN Size DID metaset metadevice mount
<WWN_SAN_ID>00A1d0 1GB d4 quorum quorum N/A
<WWN_SAN_ID>00A2d0 1GB d10 DBSID-DS d141 admin
<WWN_SAN_ID>00A3d0 5GB d5 MQFTE-DS d151 config
<WWN_SAN_ID>00A4d0 50GB d22 MQFTE-DS d161 files
<WWN_SAN_ID>011Ad0 100GB d18 DBSID-DS d111 sybase01
<WWN_SAN_ID>011Bd0 100GB d9 DBSID-DS d111 sybase01
<WWN_SAN_ID>011Cd0 100GB d11 DBSID-DS d111 sybase01
<WWN_SAN_ID>011Dd0 100GB d7 DBSID-DS d111 sybase01
<WWN_SAN_ID>011Ed0 100GB d15 DBSID-DS d121 sybase02
<WWN_SAN_ID>011Fd0 100GB d17 DBSID-DS d121 sybase02
<WWN_SAN_ID>012Ad0 100GB d28 DBSID-DS d121 sybase02
<WWN_SAN_ID>012Bd0 100GB d35 DBSID-DS d121 sybase02
<WWN_SAN_ID>012Cd0 150GB d19 APP-DS d101 app
<WWN_SAN_ID>012Dd0 150GB d25 APP-DS d101 app
<WWN_SAN_ID>012Ed0 500GB d6 DBSID-DS d131 backup
<WWN_SAN_ID>012Fd0 500GB d14 DBSID-DS d131 backup

7.6 Recreate metasets and metadevices.
The metasets must be recreated in the same order as one the existing production hosts:
Set name = APP-DS, Set number = 1
Set name = DBSID-DS, Set number = 2
Set name = MQFTE-DS, Set number = 3
7.6.1 Purge any refernces to the old metasets transferred during the initial setup:
metaset -s APP-DS -P
metaset -s DBSID-DS -P
metaset -s MQFTE-DS -P
7.6.2 Check Cluster Disk Group Status, if sets showing, they must be removed:

cldg status
/usr/cluster/lib/sc/dcs_config -c remove -s APP-DS
/usr/cluster/lib/sc/dcs_config -c remove -s DBSID-DS
/usr/cluster/lib/sc/dcs_config -c remove -s MQFTE-DS
7.6.3 Recreate the application disk set, devices and filesystem:
metaset -s APP-DS -a -h prodhost01 prodhost02
metaset -s  APP-DS  -a /dev/did/rdsk/d19
metaset -s  APP-DS  -a /dev/did/rdsk/d25
metainit -s APP-DS d101 2 1  /dev/did/rdsk/d19s0 1 /dev/did/rdsk/d25s0
=> APP-DS/d111: Concat/Stripe is setup
metainit -s APP-DS d100 -m d101
=> APP-DS/d110: Mirror is setup
newfs /dev/md/APP-DS/rdsk/d100
=> newfs: construct a new file system /dev/md/APP-DS/rdsk/d100: (y/n)? y
mount /dev/md/APP-DS/dsk/d100 /opt/APP
df -h !$
umount !$
7.6.4 Recreate the database disk set, devices and filesystem:
metaset -s DBSID-DS -a -h prodhost01 prodhost02
metaset -s DBSID-DS -a /dev/did/rdsk/d18
metaset -s DBSID-DS -a /dev/did/rdsk/d9
metaset -s DBSID-DS -a /dev/did/rdsk/d11
metaset -s DBSID-DS -a /dev/did/rdsk/d7
metaset -s DBSID-DS -a /dev/did/rdsk/d15
metaset -s DBSID-DS -a /dev/did/rdsk/d17
metaset -s DBSID-DS -a /dev/did/rdsk/d28
metaset -s DBSID-DS -a /dev/did/rdsk/d6
metaset -s DBSID-DS -a /dev/did/rdsk/d14
metaset -s DBSID-DS -a /dev/did/rdsk/d35
metaset -s DBSID-DS -a /dev/did/rdsk/d10

metainit -s DBSID-DS d111 4 1 /dev/did/rdsk/d18s0 1 /dev/did/rdsk/d9s0 1 /dev/did/rdsk/d11s0 1 /dev/did/rdsk/d7s0
=> DBSID-DS/d111: Concat/Stripe is setup
metainit -s DBSID-DS d110 -m d111
=> DBSID-DS/d110: Mirror is setup
newfs /dev/md/DBSID-DS/rdsk/d110
mount /data/DBSID/sybase01
df -h !$
umount !$

metainit -s DBSID-DS d121 4 1 /dev/did/rdsk/d15s0 1 /dev/did/rdsk/d17s0 1 /dev/did/rdsk/d28s0 1 /dev/did/rdsk/d11s0
=> DBSID-DS/d121: Concat/Stripe is setup
metainit -s DBSID-DS d120 -m d121
=> DBSID-DS/d120: Mirror is setup
newfs /dev/md/DBSID-DS/rdsk/d120
mount /data/DBSID/sybase02
df -h !$
umount !$

metainit -s DBSID-DS d131 3 1 /dev/did/rdsk/d6s0 1 /dev/did/rdsk/d14s0 1  /dev/did/rdsk/d35s0
metainit -s DBSID-DS d130 -m d131
newfs/dev/md/DBSID-DS/rdsk/d130
mount /data/DBSID/backup
df -h !$
umount !$
 
metainit -s DBSID-DS d141 1 1  /dev/did/rdsk/d10s0
metainit -s DBSID-DS d140 -m d141
/dev/md/DBSID-DS/rdsk/d140
mount /opt/sybase/admin/DBSID
df -h !$
umount !$
 
7.6.5 Recreate the Message Queue disk set, devices and filesystem:

metaset -s MQFTE-DS -a -h prodhost01 prodhost02
metaset -s MQFTE-DS -a /dev/did/rdsk/d5
metaset -s MQFTE-DS -a /dev/did/rdsk/d22

metainit -s MQFTE-DS d151 1 1 /dev/did/rdsk/d5s0
=> MQFTE-DS/d151: Concat/Stripe is setup
metainit -s MQFTE-DS d150 -m d151
=> MQFTE-DS/d150: Mirror is setup
newfs /dev/md/MQFTE-DS/rdsk/d150
mount /data/MQFTE/config
umount !$

metainit -s MQFTE-DS d161 1 1 /dev/did/rdsk/d22s0
metainit -s MQFTE-DS d160 -m d161
newfs
mount /data/MQFTE/files
umount !$
7.6.6 Re-enable metadevice mounts
Add updated mount entries:
vi /etc/vfstab
/dev/md/APP-DS/dsk/d100      /dev/md/APP-DS/rdsk/d100    /opt/app                    ufs 2 no logging,forcedirectio,largefiles,noatime
/dev/md/DBSID-DS/dsk/d110    /dev/md/DBSID-DS/rdsk/d110  /data/DBSID/sybase01        ufs 2 no logging,forcedirectio,largefiles,noatime
/dev/md/DBSID-DS/dsk/d120    /dev/md/DBSID-DS/rdsk/d120  /data/DBSID/sybase02        ufs 2 no logging,forcedirectio,largefiles,noatime
/dev/md/DBSID-DS/dsk/d130    /dev/md/DBSID-DS/rdsk/d130  /data/DBSID/backup          ufs 2 no logging,forcedirectio,largefiles
/dev/md/DBSID-DS/dsk/d140    /dev/md/DBSID-DS/rdsk/d140  /opt/sybase/admin/DBSID     ufs 2 no logging,forcedirectio,largefiles,noatime
/dev/md/MQFTE-DS/dsk/d150    /dev/md/MQFTE-DS/rdsk/d150  /data/mqfte/config          ufs 2 no logging,largefiles,noatime
/dev/md/MQFTE-DS/dsk/d160    /dev/md/MQFTE-DS/rdsk/d160  /data/mqfte/files           ufs 2 no logging,largefiles,noatime

7.7 Create another backup boot environment
#  lucreate -n s10u11_ProdImage.03withSAN

8. Data & Filesystem Migration

This is a time-critical activity, as live production systems can only be taken down for 24 hrs on sunday.
Database and all applications must be stopped to ensure no data is altered during copy.
8.1 Application files
Activity RunBook:
  1. take orignhost01 down to single user mode: init S
  2. start network interface, and set route
  3. Copy data using rsync:
    # cd /opt/app; rsync -rugpotvl . prodhost01:/opt/app
8.2 Database files
Activity RunBook:
  1. take orignhost02 down to single user mode: init S
  2. start network interface, and set route
  3. with almost 1TB of data, the copy needs to be optimized
    6 concurrent rsync sessions are the optimum, before transfer rate is impacted:
    1. # cd /data/sybase01; rsync -rugpotvl -progress hist* prodhost02:/data/sybase01/
      # cd /data/sybase01; rsync -rugpotvl -progress repo* prodhost02:/data/sybase01/
      # cd /data/sybase01; rsync -rugpotvl -progress temp*. prodhost02:/data/sybase01/
      # cd /data/sybase02; rsync -rugpotvl -progress hist*. prodhost02:/data/sybase02/
      # cd /data/sybase02; rsync -rugpotvl -progress repo. prodhost02:/data/sybase02/
      # cd /data/sybase02; rsync -rugpotvl -progress temp. prodhost02:/data/sybase02/
      
      When those large files finish clean up the remaining files:
      • cd /data/sybase01; rsync -rugpotvl -progress . prodhost02:/data/sybase01/
      • cd /data/sybase02; rsync -rugpotvl -progress . prodhost02:/data/sybase02/
      • cd /data/DBSID/backup; rsync -rugpotvl -progress . /data/DBSID/backup/
      • cd /opt/sybase/DBSID/admin; rsync -rugpotvl -progress . /opt/sybase/DBSID/admin

9. Configure IPMP Network Redundancy

Rebuild IPMP configuration for the new environment

Ensure interconnect networks are isolated from existing clusters.

# svcprop -p config/local_only network/rpc/bind false
# scinstall
# svcadm enable /network/rpc/scrinstd
# scp nfsfiler:/mnt/origHost1/clusterconfig.xml /var/tmp/clusterconfig.xml

10. Recreating the Cluster Configuration xml


10.1 Copy Original Cluster Config
mkdir /var/tmp/NewCluster
cd !$
cp nfsfiler:/mnt/OriginHost01/clusterconfig.xml
10.2 Create newfooter.xml with all the Resource Configs
Need to split the file based at these two lines:
       </devicegroupList>
       <resourcetypeList>
Determine which line numbers there are at:
 cat -n clusterconfig.xml 
/rescourcetypeList
will show, for example:
   949    </devicegroupList>
   950    <resourcetypeList>
Make a copy of the config file as a new footer:
cp clusterconfig.xml newfooter.xml
Remove all entries up to the resourcetyeList:
vi newfooter.xml
d949d
10.3 Update newfooter.xml to reflect new cluster config
:%s/orginHost01/prodhost01/g
:%s/orginHost02/prodhost02/g
:%s/orginCluster/prodcluster/g
10.4 Export XML for new Production Cluster in Verona
cluster export > /var/tmp/NewCluster/NewClusterConfig.xml
copy cd /var/tmp; cp NewClusterConfig.xml newheader.xml
10.5 Create newheader.xml with new cluster settings
Will remove all resource definitions
Determine which line numbers there are at:
 cat -n newheader.xml 
/rescourcetypeList
will show, for example:
   735    </devicegroupList>
   740    <resourcetypeList>
Remove all resource entries
vi newheader.xml
:740
dG
10.6 Merge the two configs
cat newheader.xml newfooter/xml > MergedCluster.xml

11. Recreate the Cluster Resource Groups

  1. # clrg create -i /var/tmp/clusterconfig.xml APP-rg
  2. # clrg create -i /var/tmp/clusterconfig.xml DBSID-rg
  3. # clrg create -i /var/tmp/clusterconfig.xml MQFTE-rg

12. Resource Registration

List resource types, and their version, installed on the system:
clrt list -v
Resource Type	          Node-List
----------------------    ---------
SUNW.LogicalHostname:2    <ALL>
SUNW.SharedAddress:2      <ALL>
SUNW.gds:6    	          <ALL>
SUNW.HAStoragePlus:8      <ALL>
SUNW.apache:4.2           <ALL>
SUNW.sybase:5             <ALL>
Register these specific versions:
# clrt register SUNW.LogicalHostname:2
# clrt register SUNW.SharedAddress:2
# clrt register SUNW.gds:6
# clrt register SUNW.HAStoragePlus:8
# clrt register SUNW.apache:4.2
# clrt register SUNW.sybase:5
Might see "already registered" for HAStorage and LogicalHostname, if pre-registered with the cluster.

13. Resource Creation

13.1 Create and enable Storage Resources
  1. # clrs create -g APP-rg -t SUNW.HAStoragePlus-i /var/tmp/NewCluster/MergedCluster APP-Stor
  2. # clrs enable APP-stor
  3. # clrs create -g DBSID-rg -t SUNW.HAStoragePlus-i /var/tmp/NewCluster/MergedCluster DBSID-Stor
  4. # clrs enable DBSID-stor
  5. # clrs create -g MQFTE-rg -t SUNW.HAStoragePlus-i /var/tmp/NewCluster/MergedCluster MQFTE-Stor
  6. # clrs enable MQFTE-stor
13.2 Create and enable the VIPs
  1. # clrs create -g DBSID-rg -t SUNW.LogicalHostname -i /var/tmp/NewCluster/MergedCluster APP-VIP
  2. # clrs enable APP-VIP
  3. # clrs create -g DBSID-rg -t SUNW.LogicalHostname -i /var/tmp/NewCluster/MergedCluster DBSID-VIP
  4. # clrs enable DBSID-VIP
13.2 Create and enable the Database Resources
Create the database resources:
clrs create -g DBSID-rg -i /var/tmp/NewCluster/MergedCluster DBSID-Syb
clrs enable DBSID-syb
clrs create -g DBSID-rg -i /var/tmp/NewCluster/MergedCluster DBSID-ssh
clrs enable DBSID-ssh
Online the resource group:
clrg online -M DBSID-rg
13.2 Create and enable the Application Resources
There are multiple dependencies, so each resource must be created and enabled in order:
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster APP-ssh
# clrs enable APP-ssh
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Net_server
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Naming_service
# clrs enable Net_server Naming_service
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Monitoring
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Listener
# clrs enable Monitoring Listener
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Apache-APP
# clrs enable Apache-APP
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster WebGUI
# clrs enable WebGUI
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Gds-agt-APP_QUEUE001_PROD
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster GServBSInput
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster GServXMLTransTicket
# clrs enable Gds-agt-APP_QUEUE001_PROD GServBSInput GServXMLTransTicket
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Ticketdaemon
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Feeddaemon
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Tradeserv
# clrs enable Ticketdaemon Feeddaemon Tradeserv
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster PositionPublisher
# clrs enable PositionPublisher
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Tran
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Ticketsorter
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Trade
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Position
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster PosPubApp
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Prepaytran
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Cache
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Cacheupdate
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Editserver
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Limits
# clrs create -g APP-rg -i /var/tmp/NewCluster/MergedCluster Margin
Enable all remaining resources in the group:
#clrs enable -g APP-rg
Online the resource group
clrg online -M APP-rg

14. Testing

  • Perform full failover and recovery tests
  • Validate application behaviour across nodes

15. Final Data Sync

Repeat all actions in Step 8. Data & Filesystem Migration

16. Failover/Migration to New Cluster

Immediately after the final data sync, we fully failed over to the new Italian hosts:
  • DNS for APP-VIP & DBSID-VIP changed to Italian addresses

Phase 2 - New Disaster Recovery Cluster


- backlist luns
map prod -> dr

Phase 3 - Third Copy Cluster