The guy on the street is talking about incubators and accelerators.
You bump into Dave McClure at your favorite coffee shop.
The car sales man sold software for 20 years.
People are talking about open compute in the public bathrooms.
You see the same faces at Starbucks at the same spots everyday, using it like an office space.
The Google Product manager you spoke to a couple of days ago lives right next door.
You meet more Founders than Employees.
Your car breaks down with a software bug.
You drive into an “Infinite Loop”.
Dear Partner Manager at Google / DoubleClick - We’ve been trying to get to you for the last one month. This is to get our platform Reduce Data Certified. In our first response I got a form within an email without a submit button.
In my second email a week later, I was asked to type down the responses line by line which I did and have not got any response.
I have tried to reach out to folks whom I know at Google but looks like it will take some time before I can get that channel to work for me.
In the contrary, last year before the Holidays, I got in touch with Facebook and immediately got a response from many senior folks in FB Partner Management. They were eager to sign on new advertising partners and even got into calls the following week and replied even through the weekend with real names at the end of the emails and contact info in case I wanted to get in touch with them.
This really brings to me one key question: Is Google too big to respond to partners, customers and the marketplace in general. I wish it were not true. But it seems to me like it has already become too big to be nimble.
In any case, a request: I am still waiting for response from an unnamed person in the Google Doubleclick certification team. We are losing valuable business because my company, Reduce Data is not yet certified.

Almost everyday, I bump into Entrepreneurs from India in various parts of Silicon Valley. Mountain View, Sunnyvale, Palo Alto, San Francisco, Menlo Park …they’re everywhere.
Surprisingly, many of them are ‘fresh off the boat’. That includes me as well, who came in late July 2012 and got a long term work visa a couple of months later.
In my discussions with Mukund Mohan (EIR, Microsoft Accelerator), it seems that a little over 300 companies have moved just in the last couple of months alone.
Many are gaining good getting traction, getting funded and hiring a lot of employees (both in India and in the US).
There are a couple of things that is driving this
a) Cities like Bangalore have become very expensive.
b) It is probably easier to hire and retain in Silicon Valley than in Bangalore (and sometimes in Chennai).
c) Most software product companies are built for global markets; and there is no reason to build it out of Bangalore or Chennai when you can do it in Silicon Valley
d) Silicon Valley is probably the best place to be in the world to build a technology company (You have to be here to believe it); and the lure of Silicon Valley culture is very strong - It is probably the only place in the world that has a high concentration of technologists, entrepreneurs and venture capitalists. All of it being in one place helps!
e) VP Level Jobs are at $100k in Bengeluru and Mumbai.
f) The infrastructure issues that you face (traffic, commute times, power cuts etc) are a serious impediment to building any kind of business.
Having said that these companies including mine aren’t really shutting down India offices and moving everything to the valley. We are maintaining an engineering and sometimes additionally an operations team back in India.
But the centre of gravity of these companies is definitely moving west to Silicon Valley and while this is really good for the companies, it is definitely not good news for India in general.
I logged in to Godaddy.com to change a domain and guess what?. All the domains were missing.
I call customer service, they say I deleted the domains but wouldn’t confirm the IP from which it was deleted
“Godaddy is a large business and we track millions of customers across the world, so we don’t possibly store the IP or region or computer name from which the domain names were deleted “
Another one says,”I didn’t say we don’t save this data, it is just that I don’t have access to it”.
Okay - I keep saying, I didn’t delete it. I deleted the debit cards that were on file so its possible that Godaddy possibly deleted all my domains along with my debit cards but no - the agent on phone doesn’t accept that’s what actually happened.
“We will restore all of your domains except one domain which was about to expire - we have already released that and it costs $80 to restore it”.
Me - “So you’re holding my property Illegally and asking me to pay money to release it”
Agent - “Sir that’s the charge the registrar charges …we can come halfway and discount $40”
Me - “Why did you release a domain that belonged to me..the registration was still active. And two days before the domain expired, I renewed the .co domain at $30 for a year”.
The agent, “Since the domain was close to expiry so we released it”.
Me - “So you released my private property back to the registrar without reconfirming even though to me at that at that moment”.
Agent - “Yes - You cancelled it”
Me - “But the other domains are still intact”
Agent - “Yes but the last domain was nearing expiry, so we released it”
Me- “But I just renewed that like 2 days ago, which means it should not have been released”
Agent - “Doesn’t matter. You cancelled it again”.
After nearly 30 mins of argument, the agent doesn’t relent. He offers to give a discount but will effectively charge me $30 + $11 for the domain including registration.
I remember that I requested them to “unlock” that domain so that I could transfer it to Namecheap.com. I never got to transferring it but Godaddy either has a serious issue with their system or have found a way to make a quick buck out of customers abandoning ship.
To think of it, they might have done more harm by shutting down my key domains by simply releasing (my private property) back to the registrar.
I cannot believe that this day and age, someone could be so cheap with their own long term customers. I’ve learned a valuable business today and that is to never trust a company like Godaddy ever again.
How many times has this happened to you - You search and buy a product. Then you’re constantly hounded again and again for the next several days with ads of the same product appearing all over. Chances are that it has happened more than once.
Would you buy the same vacuum cleaner twice? Would you want to buy that same book from Amazon or those amazing pair of shoes from Zappos again? I don’t think so, but today many of these ads tend to appear again even after a purchase.
Re-targeting is the method of displaying ads again to users who have seen something of interest but did not complete the transaction.
Re-targeting is useful, but displaying the same ad even after I (or any user) have already bought the product is a waste of advertiser spends and a great way to annoy any user.
Why does this happen? Many re-targeting solutions don’t really exchange necessary data that tells the Ad Networks if the user has purchased the item or not.
And without this, what happens is that you get a constant barrage of ads of the same item displayed on every other site.
This has been happening for a while now [http://www.adexchanger.com/data-driven-thinking/personalized-retargeting-overkill/] and therefore not really a new issue.
To give you an example, I did an actual transaction yesterday evening (1/22/2013) by buying a Vonage line.
I really don’t want two phone connections today. Please forgive me for even buying the first one :-).
This is a classic case of an over use of re-targeting and probably just an inefficient tool wasting advertiser spends. What’s troubling is that this has happened on a prominent ad network [Google] and not some small random player.
So how does one fix this?
The solution to this problem is a little difficult but doable. Advertisers can exchange data with ad networks (using cross browser cookies or other mechanisms) to avoid such re-targeting issues.
A simpler way would be to use a tool / network that has solutions to such problems.
Another option (which also happens to be a shameless plug ;-))
It is always also recommended that an advertiser use an independent audit system such as Reduce Data to verify ad spends.
Beyond that, Reduce Data can help identify steps in the campaign funnel causing media waste and give specific recommendations that can help optimize ad spends.
If you would like more information about Reduce Data, please head to our website at http://www.reducedata.com or our blog at http://blog.reducedata.com.
The majority of Internet services are supported by advertising. I think it is is OK if a few customers decide to take steps to block ads on their browsers. But when an ISP does that, it creates a ridiculous situation (Fast Company: http://www.fastcompany.com/3004452/french-isp-free-blocks-all-web-advertising).
I think Google and other large players should limit access to free services (Gmail, Search etc) by putting up a SOPA like blackout for at least for 1 day. This blockade should not be seen as a use of force or coercion but rather a gentle reminder to users saying that ad revenue is important to Internet based free services. And that they should request their ISP to not to unilaterally block advertising. I also suggest that the services be limited only briefly and that the user can go past and continue to use the product after seeing the message.
I believe that an action like this will raise awareness of ad supported services and hopefully French consumers will force French ISP to reconsider its decision.
This blog entry is written in response to a blog by Suhail Doshi of Mixpanel titled Bullshit Metrics (http://sufficientlyadvanced.net/bullshit-metrics)
Suhail Doshi of Mixpanel calls user signups, page views and other similar metrics as Bullshit metrics saying that these metrics don’t really correlate to the success of the startups.
Maybe Suhail forgot that only in May, Mixpanel itself had touted that it measured 7 billion actions (http://gigaom.com/2012/05/10/mixpanel-raises-10m-in-bid-to-dominate-data-geekery/). What does that mean anyway? Isn’t that bullshit metric as well?
Metrics such as user signups and page views are important. Getting enough users to sign up does matter. Retention, active users and other metrics that can only be measured after the signups have occurred in the first place.
Page views cannot be wished away when most Internet businesses are still dependent on advertising as a source of revenue:
The blog talks about Tumblr’s 20B impressions each month as one example of a bullshit metric. Now if we all agree that revenue is an important mechanism to identify a key metric then I guess we all agree that advertising revenue which is driven by ad impressions is a key metric. Ad impressions are directly proportional to page views. This means that page views is a metric that cannot be wished away. While for many, this may not be the single most important metric but it is an important metric nevertheless.
New metrics of measuring engagement especially in media such as engagement in the form of likes, or re-tweets are also important but again, similarly they aren’t necessarily the only metric that can be correlated to success.
Many Internet businesses depend on advertising revenue (including social media giants like Facebook and Twitter) and what matters to them are Page views which leads to ad impressions, clicks, conversions, cost per conversion, brand lift, etc.
The survey results below clearly highlight the industry standard:

Metrics used by Brand Marketers in North America to determine effectiveness of online ads:

Active users are an important metric for Facebook. But Page Views are important in order to justify ad impressions:
Page views are also measured even through photo flips because this helps Facebook justify the various number of impressions of ads it serves up to each user. As the pages are flipped, the ads change because FB follows the standard practice that was set by various publishers for a long time – display different ads on different pages and earn more.
Facebook’s charges are impressions based or sometimes click based. (Cost Per Milli - CPM - which represents 1000 ad impressions or on Cost Per Click - CPC basis). Page views are inherently, a very important metric.
Facebook ad screen showing how the ads are sold:

What is changing is that advertisers have been saying that things like Click Through Rate (CTR) or eCPM don’t matter but none of that has been leading to any kind of One Key Metric. It all is leading to several different metrics depending upon the mind of campaign run but again they are all linked to Page Views.
Engagement is an important metric like for example videos played. I do not disagree. But may not the only key metric. The survey below clearly shows that no single metric is important to the ad agencies / advertisers who are the primary source of revenue to someone like (Google) Youtube.

Conclusion:
I don’t think everyone is going to start dropping existing metrics and rush to find that single key metric that they need to focus on. There are a lot of people who do not agree with this and are already labeling it a fad. I don’t have an opinion on One Key Metric.
But calling key metrics that actually are directly related to the business’s success as “bullshit metrics” is being stupid or just being ignorant.
Microsoft Research - Demos of some awesome technologies…must see!
There are thousands of blogs and books on performance optimization. Yet, I thought it might be appropriate to put in some of my learning into a blog. So here it goes, a starter guide to performance tuning of web apps.
There are a lot of simple optimizations that can be done at the page level. The simple optimizations are
Do not use Rails, Play or any other app sever to serve static assets. Not even Apache. Instead use Nginx or other reverse proxies which can handle this with ease, cache them (on client browsers) and Gzip them whenever possible.
Nginx code to serve static assets given below:
location ~* ^.+.(jpg|jpeg|gif|png|ico|css|zip|js|mov|html)$ {
autoindex on;
root /home/yourpath;
expires 30d;
//If the file is not available, route the request to your app server configuration
if (!-f $request_filename) {
proxy_pass http://yourdynamicserver;
}
break;
Often, reverse proxies like nginx offer simple GZip compression where data is sent in a compressed format. Most modern browsers support this and it can be enabled using simple configuration, example for Nginx is given below
Gzip compression
# output compression saves bandwidth
gzip on;
gzip_min_length 1000;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain; #application/xml text/html text/css text/javascript;
HTTP 1.1 spec limits to about 2 connections per domain which means simultaneous requests are handled in a FIFO queue per domain. One way to tune this (disclaimer: I have not specifically tried this) is to have different requests coming from different domain within your page.
http://www.stevesouders.com/blog/2008/03/20/roundup-on-parallel-connections/
Most people who start building simple applications start building dynamic database connected applications. My suggestion: don’t build Dynamic Pages unless you have to: Let’s consider a few use cases. You have a news website. Guess how often news needs to be updated?. Not very often. If so, your app should be a CMS that outputs HTML files that are served statically. These caches could be used or even other kinds of caching could be used to speed up your web app.
A cache is typically a temporary data storage area on a disk or in Memory RAM.
Caches can be used to speed up web applications.
A brief Overview of Memcached:
Memcached is a distributed in memory store for storing and retrieving data really fast. Read more about memcached here: http://www.slideshare.net/azifali/memcached-presentation-5729628
HTTP Caching also sometimes referred as page is the process of putting a cache in front of your application server and caching pieces of your application. There are caches like Varnish, Squid available. For a detailed article on HTTP caching, visit the article: http://blog.octo.com/en/http-caching-with-nginx-and-memcached/. I will however talk about some form of page caching below using file system and using memcached.
The key point to consider in building scalable web apps is to minimize computing power used per request.
The easiest thing to do in situations where the content is the same for all users is to cache the entire page and deliver it via the file system.

You can also use the above pattern to store data into memcached and force nginx to directly connect to memcached and serve the files. The result will be much faster than fetching files from a file system. Remember, the ‘HTML Generator’ is a component that you will need to write. This component will fetch results and store it into Memcached with the file name as a key. We can then modify the nginx configuration to look into memcached first if a request comes in. If the file is not available into memcached, then the request could be passed to the file system or the underlying application server.

location ~* .(html)$ {
access_log off;
expires max;
add_header Last-Modified “Thu, 26 Mar 2000 17:35:45 GMT”;
set $memcached_key $uri;
memcached_pass 127.0.0.1:11211;
error_page 404 = /fetch;
}
location /fetch {
internal;
access_log off;
expires max;
add_header Last-Modified “Thu, 26 Mar 2000 17:35:45 GMT”;
proxy_pass http://backend;
break;
}
Assuming that there are scenarios where the entire page cannot be cached for all users. In such scenarios, it is better to apply caching to cache a specific data set, a specific part of the page, a recomputed result and so forth. This cache could be specific for each user or a common piece that could be used for all users.
A cache is typically an in-memory system such as memcached.
Example: Loading some dataset into memory using nodeJS code:
var client = new memcache.Client(11211, ‘127.0.0.1’);
client.connect();
connection.query(“SELECT business_name,id from business”, function (error, rows, fields) {
client.set(rows[i].id,rows[i].business_name ) function(error, result){
}
——
A partial segment could be a table that is repeatedly shown to the users. This could be generated and stored into memcached for use anytime later..
Example: Creating a list of links and storing it into memcached using Javascript / NodeJS:
var news;
for (i=0;i<=1;i++)
{
news=news+items[i]+”</br>”;
}
memcachedvariable.set(“news_links”,news); //set news links into memcached.
For a detailed guide on MySQL performance mistakes, please read: http://www.slideshare.net/techdude/how-to-kill-mysql-performance.
Note: Some tips from this slideshow have been compiled into these notes.
Databases are generally harder to tune and the simple reason is because databases can be tuned differently for different use cases. For example a high volume read database tends needs different optimization than a high write database.
There are no exact sets of things to do in tuning a database, but I will try to list some of the most basic things that you would do in order to ensure that the database is up and running well.
Please note: Most of these notes are from MySQL Help with some explanation where necessary
innodb_flush_log_at_trx_commit:
et this to 0 if you have a high write environment else set it to one. By setting it to 0 you’re asking innodb to write values to the innodb log and then to the file once a second instead of it happening on a per transaction basis.
innodb_additional_mem_pool_size:
The size in bytes of a memory pool InnoDB uses to store data dictionary information and other internal data structures. The more tables you have in your application, the more memory you need to allocate here. If InnoDB runs out of memory in this pool, it starts to allocate memory from the operating system and writes warning messages to the MySQL error log. The default value is 1MB.
innodb_buffer_pool_size:
The size in bytes of the memory buffer InnoDB uses to cache data and indexes of its tables. The default value is 8MB. The larger you set this value, the less disk I/O is needed to access data in tables.
innodb_commit_concurrency:
The number of threads that can commit at the same time. A value of 0 (the default) permits any number of transactions to commit simultaneously
innodb_file_per_table:
Enable a seperate file per table using this variable.
innodb_lock_wait_timeout:
The timeout in seconds an InnoDB transaction may wait for a row lock before giving up. The default value is 50 seconds. A transaction that tries to access a row that is locked by another InnoDB transaction will hang for at most this many seconds before issuing the following error:
transaction-isolation:
InnoDB supports each of the transaction isolation levels described here using different locking strategies. You can enforce a high degree of consistency with the default REPEATABLE READ level, for operations on crucial data where ACID compliance is important. Otherwise READ COMMITTED works for most use cases.
For locking reads (SELECT with FOR UPDATE or LOCK IN SHARE MODE), InnoDB locks only index records, not the gaps before them, and thus permits the free insertion of new records next to locked records
innodb_log_file_size:
The size in bytes of each log file in a log group. The default value is 5MB. The larger the value, the less checkpoint flush activity is needed in the buffer pool, saving disk I/O. But larger log files also mean that recovery is slower in case of a crash.
Most importantly ensure that queries are well tested and enough indexes are available using the “Explain” command.
innodb_thread_concurrency:
InnoDB tries to keep the number of operating system threads concurrently inside InnoDB less than or equal to the limit given by this variable. Once the number of threads reaches this limit, additional threads are placed into a wait state within a FIFO queue for execution. Threads waiting for locks are not counted in the number of concurrently executing threads.
innodb_flush_method:
This variable decides how the innodb data is written to the disk. Recommend O_DIRECT except in the case of SAN based storage.
innodb_lock_wait_timeout:
Default value is 50 seconds. If you want your app to respond faster in case of write locks, set this value lower. Note, that this means that data consistency issues will occur.
Key status variables to watch out for:
There are a number of variables that one needs to watch out while running your MySQL Databases. These are called STATUS variables. Here are a few important STATUS variables to watch out for:
Created_tmp_disk_tables:
The number of internal on-disk temporary tables created by the server while executing statements.
If an internal temporary table is created initially as an in-memory table but becomes too large, MySQL automatically converts it to an on-disk table. The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values. If Created_tmp_disk_tables is large, you may want to increase the tmp_table_size or max_heap_table_size values. value to lessen the likelihood that internal temporary tables in memory will be converted to on-disk tables.
Handler_read_rnd:
The number of requests to read a row based on a fixed position. This value is high if you are doing a lot of queries that require sorting of the result. You probably have a lot of queries that require MySQL to scan entire tables or you have joins that do not use keys properly.
Handler_read_rnd_next:
The number of requests to read the next row in the data file. This value is high if you are doing a lot of table scans. Generally this suggests that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have.
Innodb_buffer_pool_read_ahead_rnd:
The number of “random” read-aheads initiated by InnoDB. This happens when a query scans a large portion of a table but in random order.
Innodb_row_lock_time:
The total time spent in acquiring row locks, in milliseconds.
Innodb_row_lock_time_avg:
The average time to acquire a row lock, in milliseconds. If this value is high it means your queries are waiting and database needs optimization.
Qcache_hits:
The number of query cache hits. If this is high, it is actually good…if you’re reading same data frequently and if this value is low, then check if the query cache is enabled or if the queries are getting written to the cache.
Qcache_inserts:
The number of queries added to the query cache. If this value is high and increasing frequently, cache invalidation is high. If so, try to optimize your queries to not to write to the query cache by setting query_cache=0 in the MySQL Configuration file which is loaded at startup or to 2 to enable query cache only for queries that begin with that begin with SELECT SQL_CACHE.
Write asynchronous, non-blocking code wherever possible: As with all apps, a user is typically held up when an IO happens and people don’t realize how many place IO blocks can happen.
I don’t think this is a complete list of how web page optimization is done and I believe that there are more modern techniques in using asynchronous approaches to write high performing web apps.
However, I hope that this list has been a good starter guide to a small startup to help them get started with their optimization off the ground. If you have comments or questions, please write to me on twitter a @azifali or send me an email to asif.ali [at] outlook.com.
Facebook’s analytics using Hbase
Other reference links:
http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html