Ramblings of a Sysadmin

BCP and extreme weather events

Gary Williams — Mon, 18 Jul 2022 12:47:52 GMT

As I write this I'm working from home in the UK on what could be the hottest day of the year so far. It could get even hotter tomorrow and even as I write this there are people (ironically often working from home) who are suggesting that we are snowflakes for not going out and about to keep working. They are wrong.

Before I start I should disclose that I am an advocate of working from home. There are a lot of very good reasons for remote working such as reduced commute times, more time with family, cost savings and so on, I will concede that the occasional visit to the office can be useful but the majority of the time remote working is far more beneficial but that's not what this blog is about so I'll save that for another one!

Firstly, extreme weather in the UK is not the same as extreme weather in other countries. Just because Canada can deal with extreme cold and Saudi Arabia with extreme heat doesn't mean that we can. Certain engingeering choices have been made to cope with the majority of weather events in the UK. The extreme ones like today will catch them up.

Secondly, there is a lot of advice coming out that remote working during the extreme heat we are experiencing is recommended. Working remotely makes sense. Even if offices are air conditioned you still have to get to them and from them. Both likely require public transport that is struggling as much as people are. By reducing the numbers of people using the system we reduce the stress on that system and make things just a little more comfortable for those who have no choice other than to use those systems.

So, with all that in mind I am horrified at how some politicians are calling people "snowflakes" and other terms for pointing out the recommendations from a health service that is severely streched and at breaking point. Surely anything we can do to help reduce their workload we should?
If you're a team lead, manager, c level do check on staff, make sure they have whatever they need to get through these extreme events and most importantly - BE FLEXIBLE. Different people react in different ways, some may well be fine in this weather, others not so much.

And that nicely brings me on to the topic of BCP. Make no mistake that today and tomorrow are days when BCP should be considered and enacted in similar ways to the pandemic. People will be remote, some may not be able to work due to the heat in the home or due to other family issues caused by the heat. All of this needs to be taken in to account and extra flexibility allowed for.
Events like this are also a good practice run for BCP so take advantage of them and find the flaws now before another once-in-a-lifetime event happens because they are no longer once in a lifetime events.

As this is primarily an IT blog it would be remiss of me to not point out that there are risks of power cuts and aircon failures in comms room, server rooms, etc. While datacentres should have mitigation in place there is still a higher risk of a problem occuring so it's still worth doing additional checks on your infrastructure to ensure it is also able to cope with the heat. If you don't already do consider:

Per rack tempreture monitoring - If you don't have rack tempreture monitors you can use the server at the top of the rack as a sensor - heat rises so it'll get the hottest.
Alarms on servers for high temps.
Spare air conditioning capacity.
Turning kit off that you can to reduce the overall heat and power loading.

A few small steps taken now can pay dividends later on. With extreme weather events likely to become the norm we need to be prepared. Do stay safe and as cool as possible.

iDrac firmware introduces a security check that could trip you up

Gary Williams — Wed, 25 Aug 2021 10:55:42 GMT

A couple weeks ago, I had a dell server completely lock up on me, including the iDRAC which meant a 90 minute trip in to the datacentre to give the server a kick. Annoying, but not a major issue.

When I got to the datacentre I power cycled the server and everything came back up. I thought it would be worth doing a firmware upgrade in case that was the cause of the lock up. At the time, the latest iDRAC firmware was 2.81.81.81

Installing idrac firmware is pretty easy, once done the idrac rebooted and that is when I had a panic.

At first I thought that the firmware had somehow corrupted but it turns out that Dell have introuduced a new security check which is documented in the release notes that I did not read.

In short, Dell have introduced a host name check into the idrac firmware in order to mitigiate a known security issue https://www.dell.com/support/kbdoc/en-uk/000183758/dsa-2021-041-dell-emc-idrac-8-security-update-for-a-host-header-injection-vulnerability

The fix is pretty simple, all you need to do is access the iDRAC using the IP and ensure that under the idrac network settings you have set the DNS and host name correctly. Once this is done and saved your access problems will go away.

CDN's - the new internet based single point of failure?

Gary Williams — Wed, 09 Jun 2021 11:33:02 GMT

On 8th June, the CDN provider Fastly had an outage that they associated with a 'configuration issue'. What that issue was, Fastly have so far refused to say although they have alluded to it being a customers configuration and not something they directly did. They have also admitted that it is something that they should have forecast.

Outages are going to happen. Anyone who works in IT knows this. Things break and I will admit that my banner here is a little click-baity as the reason companies and even individuals use providers like fastly is because they don't have to maintain the infrastructure and an outage like we saw on 8th June is a very rare thing.

Despite everything I have said, I do wonder just how many companies that have full DR and business continuity plans actually take into account an outage of an upstream provider like Fastly. During the outage it was pretty obvious just how many sites were either fully behind fastly or had some elements reliant on Fastly.

As site note, it is interesting to see that Forbes shows up as not secure. This makes me wonder if they are using SSL offload with faslty and if the data from fastly down to forbes is actually unencrypted.

Above are screenshots of several websites that had issues during the outage. There's a lot more of course but I didn't want to cover this whole blog with similar screenshots! It's pretty clear that some big websites like Reddit and forbes are behind fastly and this makes sense.
The more hits a site gets the more the need for a CDN to help manage the load on the servers themselves and it is far easier to use a service like Fastly rather than put servers all over the world and deal with the inevitable sync issues.

While issues like the one that fastly suffered are rare, I do think that as IT pros we do need to consider what happens should upstream providers like ISP's and CDN's have a major failure, attack or even what happens if they go out of business. All of these are potentially real scenarios. All of these should have some level of DR and BCP planning. It's time for a DR and BCP plans to evolve to take into account this brave new world of upstream SaaS and PaaS providers.

Focusing on this outage, I do worry a little as Fastly are being somewhat tight lipped as to the outage:

I always get suspicious when a company doesn't go into details as to the issue. I do think companies like fastly have a duty to disclose more technical information simply because they have so much internet traffic going through them but of course, Fastly are a private company and as such, don't have to say a single thing about this outage. Until and unless there are data sharing requirements around such outages I do think that companies need to add upstream providers into both their risk portfolios and their DR and BCP practicies.

UPDATE

Well, It seems that Akamai felt the need to copy Fastly as just six weeks after Fastly took a nose dive so did Akami (https://www.bleepingcomputer.com/news/security/akamai-dns-global-outage-takes-down-major-websites-online-services/)

Akami have released a statement saying that the problem was DNS:

At 15:46 UTC today, a software configuration update triggered a bug in the DNS system, the system that directs browsers to websites. This caused a disruption impacting availability of some customer websites

Once again, this highlights the fragility of the internet when you put everything behind a single companies load balancers.

Restoring VCentre from a file based backup

Gary Williams — Mon, 19 Apr 2021 12:04:10 GMT

Thanks to a combination of a playful cat and a loose power cable one of my storage systems had an inadvertant removal of power and an outage. This was the storage that VCentre was running on and as seen in my previous blog recovering a corrupt vcentre, vcentre really does not like having it's power removed on the plus side this gave me the opportunity to try out a restore from the vcentre backups that I'd configured vcentre to send to a NAS and I'm glad I did as I learned quite a few lessons about how to successfully recover vcentre from such an incident.

Before I start with the details, I should point out that this is VCentre 6.7 - I've not tried out 7.0 yet but it's very much on the to do list.

Once I'd recovered power to the storage array I needed to have a look at the state of the VM's and I was able to log directly in to the ESXi host to see that most recovered just fine. VCentre, as we've seen before, really hates having the rug pulled from under it and so it was no surprise that it was not exactly in a heathly state. It would not even boot into VCentre due to corruption in the file system itself. I used https://kb.vmware.com/s/article/2149838 to get VCentre to at least partially boot but the postgresdb was in such a state that the only option was a restore. Thanks Cat!!

Now, me being me, I did not read any instructions on how to do this as I thought it would be a point and click excercise but it is a little bit more involved than that so if you want to see how to do a restore skip down to the part titled "Getting it all to work".

Not getting it to work

I thought I'd be able to kick off a restore by firing up the GUI and selecting restore from the menu but that doesn't work at all.

For some reason, performing the restore this way leads to a screen where no matter what you try you get an "unsupported protocol" message. I did track down a KB article which stated that restores from NFS and SMB were not supported which just seems strange to me.
The NAS that I have the backups hosted on does support FTP but getting FTP setup is quite involved and I thought that it would be easier to restore using another method so I thought I could perform the restore using the restore.job API.

And nope. It seems that VCentre doesn't support a restore from a system where the firstboot has run - i.e. any system that has previosuly actually worked! I did have a quick search to see if there was a flag I could reset or delete to remove this firstboot restore lockout but couldn't find anything so I gave up that approach.

Getting it all to work

After some searching, the process I followed to get the restore working was to delete the corrupted VCentre appliance and then deploy a fresh one but rather than select restore on the install screen I selected install in the same way as I would for deploying a new appliance.

The appliance install is split into two parts. For the restore you need to complete stage 1 only.
After that you select the restore option. Note that the restore path must be the full path down to the JSON file because if you select the root folder where the backups are sent then you will get a non-helpful error.
Do make sure you're using a file path down to a folder that looks like the screenshot below. The folders themselves are in date order so make sure you select the folder with the date you want to restore back to.

Once done you should see a screen looking like this. My backups are held on an NFS volume but I found it interesting that the restore shows it is using port 21.

This process took me about an hour as it was a mix of install and then restore. I am not sure why selecting restore from the installer doesn't go through this particular process but this method worked well and it was fully GUI driven. Once done the GUI started all the services without needing to do a reboot and VCentre was back.

In summary, the restore process is pretty easy and quite slick if you restore the components in the right order and the best right order appears to be to ignore the restore option on the installer and go for a stage 1 VCentre install and then select a restore.

Solarwinds password issue - the intern did it.

Gary Williams — Tue, 02 Mar 2021 13:52:24 GMT

Former CEO Kevin Thompson echoed Ramakrishna's statement during the testimony. "That related to a mistake that an intern made, and they violated our password policies and they posted that password on their own private GitHub account," Thompson said. "As soon as it was identified and brought to the attention of my security team, they took that down."

https://thehackernews.com/2021/03/solarwinds-blame-intern-for-weak.html

The above comment was made last week by the CEO of Solarwinds, Kevin Thompson and his replacement Sudhakar Ramakrishna.
In short, the soon to be former CEO and the soon to be CEO of solarwinds put the blame for the recent security issue on an intern.

Now, there is absolutely no reason to doubt this statement and I am going to accept it as a fact. It's also the reason that I'll never be able to trust Solarwinds again and I suggest that anyone using solarwinds products think very hard about doing so and let me explain why.

Firstly, lets have a look at what an intern is.

a student or trainee who works, sometimes without pay, in order to gain work experience or satisfy requirements for a qualification.

To summerise - Solarwinds allowed someone who was basically a trainee to modify and submit code which was then pushed to product that had a weak password in it and during that process no system, no checks and no management caught it.
Of course, Solarwinds takes your security seriously.

I worked at Symbian for a number of years. During my time there I became a storage admin on their NetApp systems. One of the things we hosted on the NetApp was the Perforce repository. During my time there an iniative was started around code signing, the idea being that apps would need to be signed so that they could be installed on a mobile phone. It's something we see a lot today and it just guarentees the integrity of the app.
Even though I was only a storage admin I still had to go on a two day internal course about how to handle requests to access the area, the perforce repository and so on. Symbian really did take the security seriously and as such, no incidents occured and certainly no intern would ever be given access to such a sensitive area. Sure, they might be told about it, they might even be show that area with a specalist engineer sitting with them but they sure wouldn't have any sort of access to be able to modify anything and this was in something like 2007.

I have several questions for Solarwinds that I'd really like an answer to, here they are:

Why did an intern have access to such sensitive data?
Why was an intern allowed to modify such senstive data?
Why wasn't their any monitoring in place for anything being posted to private github accounts?
Why wasn't the intern working with an senior developer who could have checked for such code submissions?
Why didn't automated code testing find the password?
Did the intern have any training for handling sensitve code?
Why did the intern modify the code? Was it accidential? Malicious? To fix an issue?
What changes have you made to the integrity of the code checking process to ensure that this can never happen again?
How can you claim to treat security seriously when, not long ago, an intern was able to put a backdoor password into sensitive code and it went unchecked for a year?

I will make a bet now, before March is out the soon to be CEO Sudhakar Ramakrishna will backtrack on his statement. It's too late now because the damage is done and it's not just the damage to the companies reputation for security, it goes deeper than that.

I'm no psychologist but I suspect that Sudhakar Ramakrishna made an off the cuff statement and blamed an intern for this backdoor password. I suspect he did it because he thought that saying "the intern did it" would have people going "Oh that makes sense, of course someone junior could make a mistake like that" but it opens up two sets of problems, the first being the security aspects I mentioned above and the second being that Sudhakar Ramakrishna just revealed that under his leadership, he will have no issues in pointing the finger of blame at someone and throwing them under a bus.
This is not acceptable.

If you blame someone for a mistake they made then you are showing others that you will out them and you will not have their backs. This makes for a very uncomfortable working environment where people will not want to come forward in case others get pushed under the bus.
This also means that other flaws might go unntoiced by solar winds because people are too afraid to point them out.

Sudhakar Ramakrishna just made the security and working standards at Solarwinds many times worse and he probably didn't even realise it, the fact that Kevin Thompson backed him up shows to me that the working standards problem may have been an issue at solarwinds for some considerable time.

I predict more issues in solarwinds over the course of this year.

Microsoft have a big problem with their patching and QA processes

Gary Williams — Mon, 11 Jan 2021 13:10:09 GMT

My original intention was to write up a fairly simple blog going through the recent chdsk fix as seen in this Register article here.
My plan was to take a look at the chkdsk files so I could show a way of validating the files across a network and confirming that the version you had installed across your networks was bug free as well as to point you to the patch fix so you could run the same tests I did and validate the results.

Unfotunately I can't do that because the Microsoft patching and QA process is such a mess that I wanted to highlight the range of issues I hit in trying to do something that should be fairly simple but I'm getting ahead of myself.

Let me start at the begining and the opening paragraph of the register article above:

A Windows 10 update rolled out by Microsoft contained a buggy version of chkdsk that damaged the file system on some PCs and made Windows fail to boot.

the article then adds the line:

The updates that included the fault are KB4586853 and KB4592438.

So, it should be pretty easy to get the list of files in those patches and compare the versions of chkdsk, right?

Well no. I use chrome and the latest build has a new feature that warns when downloading over an invalid connection.
While Windows Update doesn't have an insecure connection the secure connection it does have is invalid because it does not have a secure cert that matches the URL of the site:

Bearing in mind that this literally the download location for patching I for one certainly expect MS to have decent if not top notch security on the site.
I would expect the site to behind a Web Application Firewall (WAF) (I don't know if it is or not) with at least the HSTS header and older protocols being blocked so I thought it worth double checking via the Qualys SSL test site and the result is pretty dismal:

That's the first round of issues. To be fair to MS, they are not show stoppers but it is a very poor setup for the site that hosts patches. Moving on!

What about the update file itself?
The one I decided to look into more closely is KB4592438

The issue with chkdsk is listed close to the bottom of the page and I've screencapped it with the relevant part highlighted:

This confuses me because the update itself has a problem with chkdsk but the text says that the issue is resolved but neglects to say how it is resovled. The text talks about it taking "24 hours for the resolution to propogate to non-managed devices".
What resolution? Is this a patch that is downloaded automatically but not listed on the download page?
And what is a "non managed device"? Is it one that's not on the domain? One that isn't in intune? The text doesn't mention anything else about these devices.

The rest of the text is just as baffling:

enterprise managed devices that have installed the update AND encounted the issue, it can be resolved by installing and configuring a special group policy

If a group policy resolves it then I guess that managed devices must be on a domain and that is likely the difference between managed and non managed but seriously, why not just say "those on a domain" and make it clearer?

I should also point out that the issue that some machines encounter is that they cannot boot. Just how does a group policy help fix a machine that cannot boot?

Anyway, the Group policy link is actually an MSI file. opening that up shows two group policy template files - I suspect that they are in an MSI so that they'll install into the policy defintions folder in the sysvol but I never tested this out as I just extracted the files using 7-Zip and placed them into the necessary folders. I then launched GPMC expecting to see some information on what the GPO does and oh boy was I wrong:

If I'm reading the GPO correctly this GPO, called KB84586853 issue 002 rollback (catchy name!) will either enable a feature preview if the GPO is enabled or rollback something if it is set to disabled.

What does this actually mean? This is supposedely a fix for a known issue. Why does the GPO talk about feature previews? What sort of feature preview would you even have with chkdsk? One that doesn't corrupt the disk presumably!!

I suspect that the text is just generic text that has been copied across from other GPO's but that doesn't excuse the fact that the language is difficult to comprehend in light of what it is supposed to cover and doesn't actually say what it is disabling or enabling.

If I install this GPO into any corporate environment I'd expect to have to explain some technical details in a change request and I just don't have any details of what the GPO does or how it does it and that is a very poor show from Microsoft.

The fix, whatever it is, should be a patch containing just the files necessary to fix this issue. It should have a proper name and not "KB84586853 issue 002 rollback". It should at least mention chkdsk or autochk as chkdsk is the command but autochk is actually the file that runs the check disk process.

I will be keeping on eye on the next set of patches to see if MS provide any more information or a proper fix for chkdsk. The next set of patches are only a few days away so I would like to do a follow up blog to this in a few weeks time if time allows of course.

Feedback and comments are always appreciated, either here or on twitter @garyw_

Finding and fixing a broken cert for VCentre

Gary Williams — Wed, 09 Dec 2020 13:37:01 GMT

I recently needed to add a cert to a vcentre 7.0 enviroment to allow Skyline and this is normally a striaght forward process. Certainly, 7.0 is many times easier than previous versions.

Adding the cert is just a case of going into vcentre -> menu -> administrator -> certificate management

If you're not familiar with certs then I need to quickly explain that certs are basically just files that contain cryptographic elements. For VCentre there are generally three files involved, there may be more if the certificate is issued by a sub CA (Certificate Authority). If the certicate is issued by a root CA then there are just thtee files involved, they would be:

CA Public Key
VCentre Public Key
VCentre Private Key

The CA's public key is required as the CA is the issuing body for that cert. The certs public key is required so that vcentre can pass that cert out to client endpoints and the key is required so that the data can be decrypted when it's recevived.

In VCentre, cert management is basically broken up into two sections, the first section is for the CA cert and the second is for the cert itself.

I suspect that most places will use an internal CA to generate the cert for VCentre and this is perfectly fine, there really is no difference between an internal and an external cert except for how many clients the trust model covers. In the case of an internal cert, only your internal clients should trust the CA.

VCentre requires certs to be in a PEM format which is the standard for many cert generation tools - if you open up the cert file in something like notepad++ then you'll see something like this:

As long as it has the =BEGIN CERTIIFCATE= line you know you've got a PEM format cert and can continue.

Adding the root CA cert is just a matter of getting the public key (Which should be easily obtainable, it is designed to be public after all) and adding it to the cert management section.

Adding in the cert for vcentre itself needs the cert to be generated and then the public/private key exported in the previously mentioned PEM format. If you are doing this using a windows CA then you might end up with a PFX format cert. Windows, liking to be different, thinks that all clients will be windows and so doesn't provide any option to obtain a pem format. Not an issue as there are two ways to convert it. The easiest is via SSL shopper.
If you are not keen on giving your cert to a remote site and allowing them to do the conversion then you can do the change yourself if you have access to OpenSSL which is a standard linux tool. If you have the Windows Subsystem for linux then you can use that as well.

The command in OpenSSL to convert a PFX to PEM format is

openssl pkcs12 -in clientssl.pfx -out clientssl.pem -clcerts

openssl pkcs12 -in clientssl.pfx -out root.pem -cacerts

Now that we have both parts of the certificate in the correct format it is just a matter of adding them to VCentre by clicking on the Machine Cert -> Import and replace certificate -> replace with external CA certificate which will give you this screen:

The first box needs the public part of the certificate and the last box needs the private key. The middle box, the one that says "chain of trusted root certificates" only needs the public key of the CA cert if you generated the cert from the CA or it needs the public key of the CA and sub-ca if you generated the cert from a sub-ca. I don't use a sub-ca in my environment but if you do then you just need the public key of the root CA and sub-ca in the same file, it would basically be a file containing two or more sets of == BEGIN CERTIFICATE == entries.

When I attempted to add the cert I'd generated I got a stnrage error.

I have to admit that this stumped me for a few days. I could not figure out why VCentre was rejecting my cert and as you can see, the error itself is not very descriptive. Even my searches for the error 'Exception Found (Certifcate Exception. Caught exception unable to initalize java.IOexception)' were not finding very much at all.

To cut a long story short I eventually decided to run the certs through openssl to see if it could spot any issues and that's when I discovered that one part of the vcentre cert itself was corrupt. I am not sure how this happened but I suspect that the cert was opened in notepad and notepad being the text editor that it is somehow mangled the format. Either way, the problem was pretty easy to fix as I just did a fresh export of both parts of the cert from the internal CA tool and this time VCentre accepted the three parts I needed and everything worked correctly.

Validating certs in openssl is quite easy to do and it's useful to know how to do it:

openssl rsa -in privateKey.key -check

openssl x509 -in certificate.crt -text -noout

If you're not familiar with certs then it is worth setting up a CA server and playing around with internal certs a little. I also published a vlog on setting up a basic external cert with Let's Encrypt which can be an easier way to start as you do not have to set up your own CA.

The difference between DR and BCP

Gary Williams — Mon, 28 Sep 2020 13:04:04 GMT

Something I have been wanting to do a blog on for a few months now is the difference between DR and BCP as I still feel that there is confusion over the difference between the two. Another reason for writing this is simply that the UK is in a very strange position where BCP is likely to take on a whole new meaning with the upcoming fiasco called Brexit. More on that in another blog.

DR = Disaster Recovery, something has broken so badly that it needs to be recovered. This could be an exchange server that has failed or a site that has had a fire. In those scenarios you will have systems down, potentially damaged hardware and will need to either recover from backup.

BCP = Business Continuity Planning, firstly this is an awful acronym as it suggests that business continuity only needs to be planned and never needs to be tested. With the current world situation, it is clear that the occasional practice always comes in useful.
BCP is all about continue to allow staff the ability to do their jobs if a site or system is not available. BCP could even be considered as in play when a node of a cluster goes down as the work of the business continues even while the service is degraded due to the node failure.

In IT, a lot of DR and BCP planning is not formalised, it's pretty obvious that if an office loses its internet connection the people there will still need to be able to do their jobs so the most likely outcome would be to send them home to use the VPN. Same for a serious server issues. If an AD server has a fault you should have a second one to take on the load but the fault has to be fixed. Technically, you are into BCP here - the business continues to run as AD isn't a single point of failure, but the service has still suffered a failure as it is not working as planned.

DR always incorporates elements of BCP whereas BCP can be initiated without any need for any sort of preceding IT disaster as was seen with the UK wide COVID-19 lockdown. At that point, the offices were perfectly usable but outside elements forced the enactment of company BCP practices to allow the continuation of business (or as much as could be continued) without access to the usual office-based resources. I should also note that BCP goes far beyond the IT department.

BCP needs to involve every team within the business as they all have their unique requirements that can affect their ability to work remotely and it is only when experiencing such an event that the process is put to work often with a large amount of stress, quick fixes and other 'sticking plaster' solutions that often go undocumented and so increase tech debt and just become something of a time bomb for the future.

Recovering a corrupt 6.7 VCentre after a storage outage

Gary Williams — Mon, 07 Sep 2020 14:50:25 GMT

I think that there is a law in the UK which dictates that bank holidays can never be nice.
During this years August summer bank holiday (and the coldest day of summer for some years), my lab synology decided to crash. Unfortunately, this synology holds a bunch of VM's including my VCentre VM and Vcentre does not like having it's disk removed from under it.

If you are ever unfortunate enough to have a storage issue where you lose the storage and corrupt VCentre, best practise is to restore VCentre from backup, it does not matter if that backup has been taken by the inbuilt VCentre appliance backup or by a backup tool like Veeam. It is just important to have a backup. In this case though I will admit that I was curious about seeing if VCentre had suffered any issues and looking at the options to repair and recover if possible.

With VCentre down I had to connect direct to the ESXi host that the VCentre VM is hosted on and check the status of VCentre via the console. In this case VCentre was in a strange state where it was running but with lots of SCSI sense errors. I was not able to cleanly shut it down so it had to have a hard reboot. VCentre booted up just fine and at first I thought I might have gotten away with it as everything looked good in from the appliance screen:

That summary page only talks about CPU, RAM and so on. The real detail is under the services and it was here that I could see that VXPA service wasn't running. Starting it provided a nasty error message:

At this point I needed to figure out why VXPA was crashing. The best way to do this was to start the service manually from the shell and see if I got any details. I run VCentre 6.7 and it is a failry simple task to SSH onto the appliance then get into a shell with the 'shell' command.

Running the command service-control --start vmware-vxpd gave me an error saying that the service couldn't be started. The error message was not helpful:

There was one last place I could look to see if I could spot why the vmware-vxpd service was having a bad day and that is in the vxpd log.
Taking a look I saw this:

That last line is interesting, it says 'error when executing truncate table vpx_hist_stat1_98'

In SQL, truncate table means to dump all the data from a table, it's basically a delete all while keeping the table structure intact. If the reason that VPXD cannot start because it cannot truncate a table then it should be a simple matter to create table with the correct structure for VXPD to truncate.

In order to to do this, I needed to get into the postgres database. Accessing postgres on VCSA is pretty easy, it can be done with the command:

/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

Listing all the tables can be done with the command
\dt vpx_hist_stat9*

And what do you know? vpx_hist_stat1_98 is missing. I can see plenty of other tables with a similar naming scheme so I'm assuming that vpx_hist_stat1_97 is a previously truncated table that is just kept around for reasons. Now, Postgres, like all SQL implementations allows for an exiting table structure to be copied to a new table so, if I do that I should have an empty vpx_hist_stat1_98 table for the start up script to truncate

of course this is a bank holiday and so things are not allowed to be that easy!

As you can see above, firstly I tried to see the contents of the vpx_hist_stat1_98 table and got an error because it didn't exist and then I tried to clone the table from a previous one but that wasn't allowed because it already exists and finally I tried to drop the table but that also didn't work.
In essence, Postgres is telling me that the table both exists and doesn't exist depending on what I'm trying to do to it.

So, what to do?

In all SQL implementations I've come across there is a table of tables - this table pretty much tells the database engine what tables exist in the system. Postgres is no different and so, what I need to do is tell the postgres database of databases that vpx_hist_stats1_98 doesn't exist. This means updating the table of tables with the command:

delete from pg_type where typname~'vpx_hist_stat1_98';

This is a bit of a brute force delete and not exactly advisable as it is really getting down into the depths of postgres but I was curious if this would work so I gave it a go.

Now that postgres has been told that the offending table doesn't exist it should be possible to clone another database and this time, success!

The last command to run is the one to change the owner from postgress to vc. I suspect that if I tried to start up vmware-vxpd with the owner set as postgres it would fail with permissions errors.

Changing the owner of a table in postgres is done with the command:

alter table vpx_hist_stat1_98 owner to vc;

Now that the vmware-vxpd servvice has an empty table that it can happily truncate so it should start and indeed it did, after a few minutes VCentre was fully up and running for me to use.

I have run with this version of vcentre for a week now with no issues. I suspect that it is all okay although I do need to add the comment that this is not something I recommend unless there is literally no other option for getting a broken vcentre up and running. Of course, the issue may well be with a different table than the one I had to fix and I was very lucky that the one I had trouble with was one that VMWare wanted postgres to truncate anyway so there is literally no concern about data loss.

The last thing I will add is that if you don't already - backup your VCentre and if you are not sure how to do that, check back here in a few days when I will show you how to do it using the in-built VMWare appliance tools.

Windows DNS flaw is serious - patch now.

Gary Williams — Wed, 15 Jul 2020 21:50:52 GMT

Hopefully, you have probably all heard of the "Sigred" vulnerbility in DNS, a security issue which has existed in DNS servers for 17 years. That means that if you run a windows based DNS server you need to patch it ASAP. If you have a windows DNS server facing the internet then you really need to patch it now.

The reason for this blog is to hopefully act as a concise summary of the vulnerbility and the MS patch numbers you need to deploy and check for. Personally, I find Microsoft's security portal a major pain to navigate through and trying to determine which files actually get replaced to fix this vulnerbility is a major headache as it is not longer as simple as that. I hope that this post helps cut through the noise and helps make systems safe from this vulneribility.

The vulnerbility is nicely demostrated by checkpoint software in a video they posted here. In this example, they use the exploit to crash the DNS server but I understand that it can also be used to provide local admin rights as a lot of people run DNS on their AD servers.
There is also https://research.checkpoint.com/2020/resolving-your-way-into-domain-admin-exploiting-a-17-year-old-bug-in-windows-dns-servers/ which goes into an awesome amount of detail about exactly how this vulneribility works.

In short, the vulnerbility can be fixed in one of two ways:

Option 1 - Deploy the monthly security rollup patch for July 2020. The actual KB number changes depending on your OS version, here is a list:
Windows 2008 32/64 bit SP2 - KB4565536
Windows 2012 Core/GUI - KB4565537
Windows 2012 R2 Core/GUI - KB4565541
Windows 2016 Core/GUI - KB4565511
Windows 2019 build 1903/1909 - KB4565483
Windows 2019 build 2004 - KB4565503

It's worth noting that this patch can trigger a few existing bugs and likely triggers some new ones. The most interesting I've seen so far is https://support.microsoft.com/en-gb/help/4467684 where password errors can occur in the cluster service stating that a password is too short. This appears to occur only if the minimum password length policy is 14 chars or greater.

Another interesting potential side effect is Windows not booting on Fugitsu and Lenovo laptops if they have less than 8GB RAM so worth testing if you use those laptop types.

Option 2 - Change a registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters
DWORD = TcpReceivePacketSize
Value = 0xFF00

Restart the DNS Server service

The full list of OS'es, patches and more on the registry key can be found on MS site here

The CVE details for the vuln are here

At the time of my writing this blog there is not any proof of concept code. Expect this to change once the patch is reverse engineered.

I will also note that with tools like Shodan out there, it is very easy to find machines on the internet that are likely to be easy to compromise once exploit code is available. This is not a dig at Shodan, on the contrary, shodan perform a vital service to help expose how piss poor some setups are.

Exploring CentOS 8

Gary Williams — Thu, 11 Jun 2020 15:32:59 GMT

As something of a CentOS fan I have been wanting to try out CentOS 8 for a little while now and I finally got some time to do so. I am always curious to see what new features are included and what verions of various packages are included.

The current released version is 8.1 1911 - 8.2 is coming but does not have a release date yet, installing 8.1 into VMWare with just one vcpu, one gig of RAM and 16GB of disk went very smoothly. I will likely take the install files and add CentOS 8 to the list of OS'es I can deploy via WDS. I have not yet tried out the CentOS 7 unattended file but I do not see any reason why it would not work.

Once installed 8.1 I did a full yum update which dropped down 89 updates, not too bad considering that 8.1 has been out for a few months. I should point out that CentOS 8 supports yum but DNF should be the go to tool for package installs as DNF is the next interation of yum.

For this initial run through, the things I am most curious to check out are the versions of OpenSSL and Apache as I want to move my hosted machines over to TLS 1.3 which can only be done with a newer version of OpenSSL and I'm pleased to see that included with 8.1 is OpenSSL 1.1.1c, version 1.1.1 supports TLS 1.3 which offers a lot of security features over TLS 1.2 including huge improvements in the implementation of Perfect Forward Secrecy..

Installing Apache gave me version 2.4.37 and I still had to install mod_ssl to get access to the SSL engine. I will admit that I'm a little surprised that Apache does not come withe mod_ssl built in by default.
Once apache was installed I generated a cert in my PFSense CA I was able to get a test web site up and running using HTTPS over TLS 1.3 in just a couple minutes:

My first attempt to connect to the website failed because the in built firewall only has three inbound rules allowed, dhcpv6, ssh and cockpit.

CentOS 8 uses FirewallD which which is a fairly easy to use firewall system. I won't go into detail here as this article is an excellent go to source for all things firewalld related. Once I had added HTTP and HTTPS I was able to connect to my website just fine.

Now that the test website was up and running over HTTPS I did take another look at the firewalld permanent services and I was curious what cockpit was and why it was included in the defuly list of permanent services. I had not heard of it before. Well, it appears that cockpit is the linux worlds version of Microsofts Windows Admin Centre - formally honolulu that I covered some time back.

Installation via dnf install cockpit was pretty easy and replacing the default cert with an internally trusted one was easy enough and once done I had full access to cockpit on CentOS.

It is a nice addition and it is easy to retrofit to CentOS 7 although some features are missing on CentOS 7. I am told that some features are missing on CentOS 8 as well as when running cockpit on Ubuntu there is a dashboard option that does not appear in CentOS 8. Cockpit is certainly an interesting addition and something I will be looking to understand better and use more in the future.

One thing I was surprised to see was that 8.1 still allows a login as root over SSH. Other distros have blocked this by default and I would like to see CentOS do this. Logging in as root is a security risk because it encourages people to not bother creating accounts with su type rights, other than that minor quibble I will admit to quite liking CentOS 8 and I will be deploying it to replace my CentOS 6 and 7 boxes.

Handy Apps - Windows terminal

Gary Williams — Fri, 29 May 2020 14:54:44 GMT

Microsoft released the Windows terminal some time back and I have been using it since about version 0.1

If you have not seen the windows terminal before you can think of it as an upgrade to the command prompt. In many ways, it makes the command prompt more browser like with the addition of tabs. The windows terminal allows for simulateous sessions across various cli prompts. It is possible to have sessions in the standard cmd prompt, powershell and linux all at the same time.

The really nice thing about Windows terminal is just how configurable it is. If you hit settings you will not see a traditional settings menu. Instead you see a JSON file. While a little scary to look if you have not seen one before. This sort of thing makes Windows terminal almost infinitely configurable.

There are already a whole list of additional command line plugins that you can just add to the JSON file. There is a good GIT cli option here.

Installing it is very easy. You can get it from the Microsoft store. Personally, I am not that much of a fan of the MS store and it is awesome to see Microsoft starting to support [chocolately](https://www.gdwnet.com/2016/06/13/have_you_used_chocolatey/ installs as a quick "choco install microsoft-windows-terminal" will grab the latest build. Updates are just as easy to run via a "choco update microsoft-windows-terminal". Chocolately will need to run as a local admin for this to work.

There are also a hell of a lot of customisation options that I will touch on in future blog articles, for now, I just want to highlight how much better windows terminal is than the standard command line. As an example of this, Windows terminal supports a split pane option and you can get a view for how powerful this is with the command:

wt -d c:\ ;split-pane -p "command prompt" ; split-pane -p "ubuntu"

This is three panes. On the left is a standard windows cmd prompt. Top right is the unbuntu shall and bottom right is powershell.

I will be using the windows terminal in a lot more articles. For now I just wanted to highlight how handy this app is. You should go get it right now!

The problem with VPN's

Gary Williams — Wed, 25 Mar 2020 17:48:42 GMT

Like a lot of IT people over the past couple weeks I've been mostly fighting VPN issues and I thought it might be handy to go through a few of the issues I have seen and what (if anything) we have been able to do to fix them.

While I do not want to go into any details about the VPN software we run (for hopefully obvious reasons), I will explain that the VPN solution itself has two options - split tunnelling (the default) which sends specific routes down the VPN and the rest out to the internet and send all where all traffic is tunelled to the datacentre where the VPN terminates.

The datacentre itself has a pretty standard shared 1gbit internet line. This line is used for traffic in and out as well as being the link over which all VPN traffic flows. We do offer services over that link as well.

Because of the numbers of people now working from home and concerns around potential accidental flooding of the link thanks to someone using not using split tunelling and then deciding to stream a high def movie, the decision was made to disable the send all tunnel option so now everyone has to connect using split tunelling only and the same time additional bandwidth limitations were in put in place to prevent any one person from saturating the link by doing something silly like downloading half the file server.

With these mitigartions in place we still saw a few issues and I thought it might be worth listing them:

Whitelisted Sites

I suspect that, like a lot of companies, we have site to site VPN's to specific partners. Those partners whitelist our external IP addresses which is fine for when people are in the office or when they use the 'send all' tunnel down to the datacentre. The problem we have seen here is that with the send all option disabled, we have had to add a lot of additional routes to the VPN so that those whitelisted sites now travel down the VPN. While relatively easy to work around with the routing fix, the sheer number of them was a surprise.

Bandwidth, especially with remote sites

We have a number of remote sites, those sites are linked via a site to site VPN. Often, those sites are too small to have their own hardware and so they do not have a VPN endpoint. This means that when they work from home they VPN into the datacentre which can introduce a large amount of latency for them. While far from ideal, for the moment it is the only option we have. We have looked at cloud for this but there are cost concerns.

Legacy apps

Not exactly a VPN issue but we discovered a few internal legacy apps that are very sensitive to bandwidth. This has meant that when connecting to them, they'll sometimes time out because they are expecting the user to connect at pretty much the full 1gbits like they would in the office. In some cases, providing a desktop they can RDP to has resolved the issues, in other cases, devs have had to take a look under the hood of some seriously old apps.

Monitoring

For the most part, the VPN solution has had basic monitoring - is it up? Does it have enough disk space, etc, etc. Now that we have a lot more people working from home, infosec have requested a lot more monitoring which has come with its own challenges. We've largely been able to exploilt an API And some text manipulation to get the data we needed.

Overall, the user base has adapted well to remote working and the VPN solution has coped admirably. There have been challenges and frustrations but things are starting to head the right way.

BCP and Covid-19

Gary Williams — Tue, 03 Mar 2020 12:28:12 GMT

Who remembers the 25th May 2018?

That was the date that GDPR became law, it was also the day that the vast majority of companies sent out emails saying "We have updated our privacy policy". GDPR was known about for at two years prior to the launch date so why did so many companies wait until the very last minute (and in some cases, several days after the last minute) to review their privacy polices and update them?

BCP (Business continuity Practice) is often discussed alongside DR (Disaster Recovery) and while the two certainly go hand in hand during a systems outage or disaster that takes out a datacentre/cloud provider etc, BCP is something
that needs to be considered on an at least monthly, if not weekly or even daily basis under certain circumstances.

The whole point of BCP is about how to keep the business running if something happens, note that I am using the term "something happens" and not "something unexpected" because BCP isn't there just for the unexpected but for the expected and what do I mean by that? well, GDPR is a great example.
GDPR was known about for two years and yet how many companies ignored it until the final few weeks? While I don't have official figures for this, I will bet that the majority did based on the email flood I had in the 24 hour period before GDPR went live.

A good company would allow their BCP team to pull together the people it needed to ensure that GDPR did not cause any disruption to the normal flow of work, that same BCP team should be keeping a watch for anything that could potentially harm or disrupt normal business.

This leads me to the current conerns around Covid-19, the coronavirus currently at risk of becoming a pandemic. Now, I'm not going to comment on anything medical here, there is some good advice on the NHS website and, if in doubt, do speak to your local health providers.
What I'm more interested in is how business will ensure continuity of service in the event of a Wuhan style lockdown or should the general advice be to self isolate where possible.

A good BCP needs to consider several things:

Can staff work remotely?
If staff can work remotely, is the capacity there on any remote access systems or do they need augmenting?
Can additional facilities (e.g. AWS workspaces) be brought online in the cases where remote staff may not have corporate laptops?
How do we cope if office support staff (cleaners, security, etc) are unable to come in?
Do we need to slow down work plans and delay delivery schedules to accommodate sick team members?
How will we handle local system issues where a staff member would be needed to attend site to replace a part?
Do we have sufficient spare parts to handle a slowdown in the supply chain?

I know a few companies that are deeply concerned about a breakdown in the supply chain as many companies shun keeping spares on site seeing it as something of a sunk cost with no real pay off as the supply chain has not really let us down in the past. However, that's ignoring several things here. Firstly, the whole supply chain is based on a JIT (just in time) framework. Any disruption anywhere along that line and it can have a real impact and as the supply chain often involves transiting several countries there is a real risk of something like covid-19 or even a no deal brexit causing a significant disruption.

This is not to say that panic buying spare parts is the right solution, rather that a balanced approach should be taken over the course of time and stocks slowly increased. There is something of an irony that panic buying could disrupt the very supply chain that companies rely on long before covid-19 (or some other issue) really hits the supply chain.

If the worst does hit and Covid-19 turns into a pandemic, it becomes more important than ever to support staff if they are caught up in the middle of it.
I fear that too many companies will use guilt and intimidation to force sick people to work rather than letting them fully recover. I can also foresee cases of companies abandoning staff who might be on
overseas travel should major disruption hit. Don't do this, do support your staff and be understanding around deadlines, schedules, etc. The IT industry does a lot of talk around being agile and flexible and this is one of the times when it can really be shown to be agile, flexible and hopefully supportive.

The last thing I want to say is simply this - do take care of yourselves and family before any work commitments and I hope that Covid-19 ends up blowing itself out in a relatively short time frame.

LDAPS on Windows Servers

Gary Williams — Wed, 05 Feb 2020 14:48:50 GMT

As some are probably aware, Microsoft were planning on releasing a patch in March 2020 that will make some changes to LDAP. Since that announcecment MS have admitted that they have had a lot of feedback on this upcoming change and so have scaled back their plans. Thier blog article is pretty good and worth a read just to understand what is coming. You can read it here.
I will admit that even though this article is a few months old, (I assume that the date is in US format, so 4th Nov 2019 - People, please use dates that the whole world recognises), I wasn't aware of the planned changes to LDAP until a few weeks ago.

It is worth having a read though the MS article and checking to see if your AD environment uses LDAP for any services. If it does, it is a very good idea to move over to LDAPS because LDAP does not use any encryption. If it's being used in AD then passwords will be going over the wire in plain text.

Fortunately, enabling LDAPS on AD servers is not a difficult task. All it needs is a cert that supports server authentication and that is it.

If you run a Windows CA environment then the chances are that you already have the necessary certs in place as the Windows CA can do these for you. If not, it is very easy to add them using a different CA host.

The first thing to do is to test LDAP and LDAPS just to confirm the current status. I really like the LDAP browser application for this. It is free and you can download it from here (just make sure you click on the LDAP Browser tab as that is the free one).

Once downloaded, install the app, launch it and create a profile, add in the name of one of your AD servers then click on the 'credentials' tab and either select "Currently logged in user" or select "other credentials" and then "GSS negotiate" from the drop down. LDAP on Active Directory does require an authenticated user, it cannot work with an anonymous user.

Once complete, hit OK and you should get a connection to the LDAP server.

That means that everything is working on port 389 and this should be the same for all your AD servers. LDAP should work right out of the box.

The next thing to do is test the connection against LDAPS and to do this, it is just a matter of changing the profile to use a secure connection:

When I run this test against one of my lab AD servers I get a com error:

This means that my server is not able to talk LDAPS. Fortunately, enabling it is pretty simple.

To get LDAPS working, I need to install a cert on that server. For LDAPS to work, I need a cert that supports "Server Authentication" as a cert function and fortunately, my personal internal CA of choice (pfsense) supports this type of cert so all I need to do in PFSense is create a cert. I could create one cert for each domain controller or I could just create a wildcard cert but I am not a fan of either option so what I will do is create a single cert that can support both AD servers.

In my PFSense internal CA, all I have to do is create a server certificate and name my two AD servers in the Alternative Names section as this will ensure that the cert will work against both servers

Once that has been done, I can download the private key and the cert from PFSense, now, windows being windows, I need to convert the public and private key files into a pfx file as that is what windows prefers and to do this I just use the online tool at https://www.sslshopper.com/ssl-converter.html

At this point, a few people might be horrified that the CA holds the private key and that I am using an online service to do the cert conversion especially as I will have to hand over my private key. Well, two things to remember - the CA is under my control so I am not too bothered about it hosting the private key and the second thing is that these certs are internal only. To get my internal clients to trust the CA I need to push out the CA public key to their trusted cert stores. If this was an external site I would not be as happy to give up the public key as I am with my internal CA.

Once I've got my PFX file, all I need do is copy it over to the domain controller and add it into the local machine cert store. It must be the local machine as LDAPS is "owned" by the server itself and not any admin who logs into the server:

And you should now be able to connect to the server using LDAPS.

Ben Hooper covers this process on the Windows CA and Let's encrypt side plus he has some very handy powershell commands for interrogating your AD servers for LDAPS. His blog article is worth a read and it can be found here